LLMs Know More Than They Show: A Fascinating Look Inside

Discover new research showing that LLMs often “know” the right answer internally but still produce errors. Learn how probing classifiers can detect hidden truths to reduce AI hallucinations and improve reliability.

bymagicteam

February 14, 2025

Futuristic humanoid robot with a transparent dome showing an illuminated orange brain.

A cutting-edge humanoid robot reveals its glowing neural network in a sleek, futuristic design.

LLMs Know More Than They Show: A Fascinating Look Inside

What if the secret behind hallucinations lies within the model itself?

A collaboration among researchers from Apple, Technion, and Google Research has arrived at a striking discovery: Large Language Models (LLMs) often “know” a truthful response yet provide a different answer. In other words, they may hold more accurate information internally than their final outputs reveal.

This insight has major implications for how society perceives LLMs, and it also points to new techniques that might force models to produce the “truth” they appear to contain. Importantly, a unique method for reducing hallucinations could soon become vital for organizations that rely on AI.

Beyond Standard AI Newsletters

You might be weary of newsletters that merely list AI updates without explaining their deeper significance. True analysis demands knowledge, research, and thoughtful inquiry—qualities found in TheTechOasis, a newsletter that addresses critical AI questions in an accessible yet in-depth manner.

Subscribe to TheTechOasis to stay informed and think critically about AI’s most pressing challenges.

In Pursuit of Truth

Hallucinations—where a model outputs incorrect information—are among the biggest shortcomings of LLMs today.

Uncertainty Modeling and Hallucinations

Hallucinations often stem from a model’s uncertainty about its own predictions. LLMs generate not only words but also internal probabilities for each token choice. For instance, a model might say “London” is the capital of England with 99.9% likelihood and, oddly, also suggest it is the city with the highest tea consumption at a 75% likelihood. Though the first statement is correct, the second is not (given that Turkey has the highest per capita tea consumption).

Until now, a prevailing approach to reduce hallucinations has been to examine the model’s confidence or uncertainty. When confidence is too low, the system might refuse to produce an answer. But what if the model’s “confidence” tokens aren’t actually the best measure of whether its response is correct?

An Introspective Look at Hallucinations

A new group of researchers suspects the real key may lie deeper inside the model’s internal representations. Their reasoning is straightforward: maybe the model is concealing the correct response—or at least a sense of it—even as it provides a flawed final answer.

They investigated whether internal vector representations within the LLM, rather than its final stated confidence, can indicate when it’s about to produce an incorrect statement.

A Quick Refresher on How LLMs Work

Think of a prompt like “The cat climbed the tall…” and the expected continuation might be “tree” or “table,” because both make sense. The possibility of “sea” is illogical in that context, so it’s highly improbable.

LLMs function similarly, except they use numeric embeddings to represent words. During text processing:

Each word in the sequence becomes a vector.
The model updates these vectors based on the context—essentially “re-reading” the text multiple times to refine its understanding.

Bigger models (e.g., “Llama 3.1 405B” vs. “Llama 3.1 8B”) differ in how many times this process happens and how extensively.

This internal procedure accumulates semantic meaning that helps the model pick the “next token.” The question is: Does the model’s internal state reflect knowledge of the correct answer, even if it later outputs something else?

Identifying Truth Within

Researchers employed a probing classifier: a secondary system that reads the model’s hidden representations and predicts if the next token will be correct or not. This classifier is trained on examples of correct and incorrect outputs, essentially learning patterns that occur when the model “knows” it’s right versus when it’s about to hallucinate.

Why Not Just Look at the Model’s Own Probabilities?

LLMs produce probability estimates for each token. However, these probabilities measure likelihood, not truth. Because training data can contain misinformation or biased contexts, the “most likely” token might be untrue. In fact, the entire premise of LLM training is to pick the statistically probable word, not necessarily the verifiable truth. Therefore, the classifier aims to detect a separate signal: Does the model realize that the probable token might be factually incorrect?

The Results

In every tested scenario, the classifier successfully anticipated whether the model’s eventual output would be right or wrong. That means there’s some consistent signature in the model’s hidden layers when the LLM “internally recognizes” it’s making an error—even though it still may produce that erroneous token.

A caveat: these classifiers don’t generalize across tasks or data sets. One must build a dedicated classifier for each unique context, which in practice may be acceptable since organizations typically use LLMs for specific tasks.

Toward a Richer Understanding of LLMs

This research is a powerful example of how much remains unknown about how LLMs operate. Even something as fundamental as whether they track “truth” remains partially mysterious. That’s both unsettling—given the enormous investments in AI—and promising, because new discoveries may refine how these models are used.

Moreover, the study offers a novel route to handle hallucinations: adding a specialized classifier that halts or corrects an LLM before it issues a faulty response.

For businesses, this tactic could become essential, ensuring output accuracy in automated processes. The concept is simple: “If the classifier warns that the next answer is likely wrong, don’t publish it.” This approach could significantly reduce the detrimental effects of incorrect LLM responses.

Companies and industries anxious about adopting AI—due to risks of misinformation—may find renewed confidence through such advanced error-checking mechanisms. The closer we come to neutralizing hallucinations, the more viable LLMs become for mission-critical tasks.

Conclusion: LLMs Remain Crucial—But We’re Still Figuring Them Out

Large Language Models undeniably have limitations and are, in many ways, less capable than some initial hype suggested. Yet ongoing research continues to demystify their internal workings, revealing methods for making them more reliable. LLMs are here to stay and will likely play an increasingly important role in society and the global economy.

We just need better insight into how they function—and how to harness the unspoken truths they already carry within.