This method uses "critical tokens", which I don't think you can detect until aft...

This method uses "critical tokens", which I don't think you can detect until after you've generated an entire response. Using them as part of the search function seems infeasible for long outputs, but technically possible. Using it on single token outputs seems imminently feasible and like a cool research direction.

I think the paper itself demonstrates that the model has something internally going on which is statistically related to whether it's answer is correct on a given benchmark. Obviously, the LLM will not always be perfectly accurate about these things. However, let's say you are using an LLM to summarize sources. There's no real software system right now that signals whether or not the summary is correct. You could use this technique to train probes to find if a human would agree that the summary is correct, and then flag outputs where the probes say the output wouldn't agree with a human for human review. This is a lot less expensive of a way to detect issues with your LLM than just asking a human to review every single output.

While we don't have great methods for "curing it", we do have some. As I mentioned in a sibling post, contextual calibration and adding/adjusting training data are both options. If you figure out the bug was due to RAG doing something weird, you could adjust your RAG sources/chunking. Regardless, you can't put any human thought into curing bugs that you haven't detected.