New research from Technion, Google Research, and Apple shows that large language models (LLMs) gain more insight into correctness than expected.
A major problem with large language models (LLMs) is their tendency to produce misleading or nonsensical outputs, often referred to as “hallucination”. The term “hallucination” does not have a universal definition and covers a wide range of LLM flaws.
In this study, the researchers adopted a broad interpretation: that is, to consider the illusion to be all errors produced by LLM, including factual errors, biases, and other errors in the real world.
Most previous research has focused on analyzing the external behavior of LLMs and how users perceive these errors, while this new research investigates the internal workings of LLMs, specifically “correct answer tokens” – feedback tokens that, if modified, would change the correctness of the answer – to evaluate the correctness of the outputs.
The researchers conducted experiments on four variants of the Mistral 7B and Llama 2 models across 10 datasets, showing that information related to correctness is concentrated in correct answer tokens. They found that training classifier models to predict correctness-related features of outputs improves error detection.
“These patterns are consistent across nearly all datasets and models, suggesting a common mechanism by which LLMs encode and process correctness during text generation,” the researchers said.
To predict the “illusion,” the researchers trained models called “probing classifiers” to predict features related to the correctness of the outputs generated based on the internal workings of the LLM. Training these models on “correct answer tokens” significantly improved error detection.
They also investigated whether an exploratory classifier trained on one dataset could detect errors in other datasets and found that these classifiers did not generalize across different tasks, but could generalize to tasks requiring similar skills.
Additional experiments show that the exploratory classifiers can predict not only the presence of errors but also the type of errors the model is likely to make. This finding suggests that the internal workings of the model can determine the correct answer, but the model frequently produces incorrect answers. This suggests that current evaluation methods may not accurately reflect the true capabilities of these models.
Finally, the findings suggest that current assessment methods may not accurately reflect the true capabilities of LLMs. Better understanding and leveraging the internal knowledge of these models could significantly reduce errors.
The study's findings could help design better illusion mitigation systems. However, the techniques it uses require access to LLM's internal representations, which is mostly available with open-source models.
Leading AI labs like OpenAI, Anthropic, and Google DeepMind have been working on various techniques to interpret the inner workings of language models. This research could help build more reliable systems.