Why AI Keeps Making Up Facts - UNU Campus Computing Centre

Have you ever asked ChatGPT, Claude, or another AI assistant a specific question – only to get a confident, detailed, but completely wrong answer? You’re not alone. This phenomenon, known as “hallucination” in AI research, remains one of the most persistent challenges facing large language models (LLMs) today.

Now, a new research paper from OpenAI and Georgia Tech sheds light on why this happens, and more importantly, what can be done about it. The study reveals that these hallucinations aren’t mysterious glitches but predictable outcomes of the way we train and evaluate AI systems.

Explore a mindmap of the paper, produced using my custom AI tool, for a clearer understanding of its structure. A more detailed version of the mindmap is available here.

The Birthday Problem: When AI Gets Overconfident

Let’s start with a simple example that illustrates the core issue. When researchers asked a state-of-the-art language model “What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM,” the AI confidently provided three different wrong answers across three attempts: “03-07”, “15-06”, and “01-01.” The correct answer? It’s actually in autumn – none of the AI’s guesses were even close. This wasn’t a one-off mistake.

When asked about dissertation titles, multiple leading AI models hallucinated completely fabricated academic credentials:

ChatGPT (GPT-4o) claimed: “Adam Tauman Kalai’s Ph.D. dissertation (completed in 2002 at CMU) is entitled: ‘Boosting, Online Algorithms, and Other Topics in Machine Learning.’”
DeepSeek stated: “Algebraic Methods in Interactive Machine Learning”… at Harvard University in 2005.
Llama asserted: “Efficient Algorithms for Learning and Playing Games”… in 2007 at MIT.

None of these were correct. The actual dissertation was completed in 2001, different from all the AI-generated claims.

It’s Not Just About Rare Facts: The Letter-Counting Surprise

Perhaps even more surprising, these powerful AI systems struggle with seemingly simple tasks. When asked “How many Ds are in DEEPSEEK?”, multiple advanced models gave wildly incorrect answers ranging from “2” to “7” – when the correct answer is just “1”.

This reveals something profound: hallucinations aren’t just about obscure knowledge gaps. They occur even for tasks that should be straightforward, suggesting deeper systematic issues in how these models process and respond to information.

The Hidden Logic Behind AI Hallucinations

The research reveals that hallucinations follow a predictable mathematical pattern. The core insight is surprisingly elegant: generating correct text is fundamentally harder than simply recognizing whether text is correct or not.

Think of it this way: if you can’t reliably tell the difference between a true statement and a false one, how can you consistently generate only true statements? The researchers prove that language model errors are mathematically connected to this basic classification problem.

The Singleton Prediction

One of the most practical discoveries involves what researchers call the “singleton rate”, the fraction of facts that appear exactly once in training data. The study shows that if 20% of birthday facts appear only once during training, we should expect the AI to hallucinate on at least 20% of birthday questions. This gives us a powerful predictive tool: we can estimate hallucination rates by analyzing training data frequency patterns.

The Real Culprit AI

But here’s where the story gets really interesting. The research reveals that the persistence of hallucinations isn’t primarily a technical problem. It’s a measurement problem.

The Test-Taking Trap

Current AI evaluation systems work like school exams: they award points for correct answers and zero points for saying “I don’t know.” This creates a perverse incentive structure where AI systems learn that confident guessing beats honest uncertainty.

Consider two hypothetical AI models:

Model A: Answers questions only when confident, says “I don’t know” when uncertain, never hallucinates
Model B: Always guesses confidently, never admits uncertainty, sometimes hallucinates

Under current evaluation methods, Model B will consistently score higher on benchmarks because it attempts every question, even when wrong. Model A gets penalized for its honesty.

This explains why even extensive post-training to reduce hallucinations has limited success. The fundamental incentive structure rewards overconfidence.

The Epidemic of Binary Grading

The researchers analyzed popular AI benchmarks and found that the vast majority use binary scoring systems that penalize uncertainty. These include:

GPQA: Multiple-choice accuracy with no credit for abstention
MMLU-Pro: Strict correct/incorrect grading
SWE-bench: Binary pass/fail for software patches
MATH: Equivalence grading with no partial credit for uncertainty

This creates what the paper calls an “epidemic of penalizing uncertain responses” across the AI research community.

A Simple Solution: Confidence-Aware Evaluation

The fix is surprisingly straightforward. Instead of hiding penalty structures, evaluations should explicitly state confidence requirements in their instructions, such as:

“Answer only if you are >75% confident, since mistakes are penalized 3 points, while correct answers receive 1 point, and an answer of ‘I don’t know’ receives 0 points.”

This approach, borrowed from human standardized testing (like older SAT exams that penalized wrong answers), would:

Reward appropriate uncertainty instead of punishing it
Make evaluation criteria transparent rather than hidden
Allow direct comparison of models across different confidence thresholds
Provide immediate implementation in existing benchmarks

Why This Matters Beyond Academic Research

These findings have profound implications for real-world AI deployment:

Trust and Safety

When AI systems are used for medical advice, legal guidance, or educational content, hallucinations can cause serious harm. Understanding their statistical nature helps us predict and prevent dangerous overconfidence.

Business Applications

Companies deploying AI assistants need to balance helpfulness with accuracy. The confidence-targeting approach provides a principled way to tune this trade-off.

AI Development

Rather than treating hallucinations as mysterious bugs requiring complex technical fixes, we can address them through systematic evaluation reform.

Changing How We Measure Success

The research suggests that the AI community’s focus on building better hallucination detection systems, while valuable, misses the bigger picture. The core problem isn’t the lack of specialized evaluation tools. It’s that our primary evaluation methods systematically reward the wrong behaviors.

Three Key Changes Needed:

Reform mainstream benchmarks to include confidence-based scoring
Adopt transparent penalty structures instead of hidden binary grading
Value appropriate uncertainty as much as confident correct answers

Beyond “I Don’t Know”

While explicit uncertainty expressions like “I don’t know” provide a starting point, the ultimate goal is more sophisticated uncertainty communication. Future AI systems could hedge their responses, ask clarifying questions, or offer confidence ranges, but this will happen only if our evaluation frameworks value and reward such nuance.

Conclusion: From Mystery to Solution

The paper suggests that hallucinations aren’t mysterious quirks but have a clear explanation rooted in statistical learning theory. Hallucinations aren’t mysterious quirks but predictable outcomes of training systems to maximize scores on tests that penalize honesty about uncertainty.

The solution doesn’t require revolutionary new AI architectures or training methods. Instead, it demands something simpler but perhaps more challenging: changing how we measure AI success to align with what we actually want from these systems – trustworthy, appropriately confident responses that admit uncertainty when warranted. This evaluation-first approach suggests that the AI community has been looking in the wrong place for solutions. Instead of complex technical fixes, we need systematic reform of how we test and benchmark AI capabilities.

As AI systems become increasingly integrated into our daily lives, making this shift from overconfident guess-machines to appropriately cautious assistants isn’t just a technical improvement. It’s essential for building AI we can truly trust.

The tools for this transformation already exist. The question is whether the AI community will implement them before the costs of overconfident AI become too high to ignore.