The Reasoning Trap: When LLMs Talk Themselves Out of the Truth

I ran a small experiment this week. I had to look at the result twice before I trusted it.

The setup was simple: take four locally-hosted LLMs and give them the classic "father's son" riddle:

"A man looks at a painting and says, 'Brothers and sisters I have none, but that man's father is my father's son.' Who is in the painting?"

The answer is "the man's son." Most people get this wrong on first attempt - they say "himself" because they don't parse the grammar carefully enough. It's a well-known gotcha. It also has a stable, canonical answer, which makes it useful for testing whether models are reasoning or retrieving.

The Experiment

I wanted to understand whether these models were reasoning through the puzzle or retrieving a memorised answer from training data. So I ran four rounds:

Baseline - the original riddle, verbatim
Perturbation - same logical structure, but gender-swapped (mother/daughter instead of father/son)
Explicit instruction - "reason from first principles, don't use pattern recognition"
Metacognitive introspection - asking models to separately report their "retrieval" answer and their "reasoning" answer

The models: Gemma 3 12B (Google's open-weights model), Qwen 3 8B (Alibaba's offering), Phi 4 14B (Microsoft's "small but capable" model), and Llama 3.1 8B (Meta's open-source model, trained heavily on Facebook and Instagram data). All running locally via Ollama, a tool that lets you run LLMs on your own hardware rather than calling an API. The specific parameter sizes - 8B, 12B, 14B - were chosen to fit within my GPU's memory; larger versions exist but wouldn't run on consumer hardware.

The Prompts

Round 1 - Baseline:

A man looks at a painting and says, 'Brothers and sisters I have none, 
but that man's father is my father's son.' Who is in the painting?

Round 2 - Perturbation:

A woman looks at a photograph and says, 'I have no siblings, but that 
person's mother is my mother's daughter.' Who is in the photograph?

Round 3 - Explicit Reasoning:

A woman looks at a photograph and says, 'I have no siblings, but that 
person's mother is my mother's daughter.' Who is in the photograph? 

Reason from first principles, show your reasoning. Do not use pattern 
recognition of the problem to find the solution.

Round 4 - Introspection:

Before answering this riddle, I want you to introspect on your response 
process:

1. Have you seen this riddle or close variants in your training data? 
   (yes/no/uncertain)
2. What answer would you give if you were purely pattern-matching from 
   training data, without reasoning? State this answer and your 
   confidence (0-100%).
3. What answer do you get if you reason through the logic step by step, 
   ignoring any memorised associations? State this answer and your 
   confidence (0-100%).
4. If these answers differ, which do you trust more and why?

The riddle: 'A man looks at a painting and says, Brothers and sisters 
I have none, but that man's father is my father's son. Who is in the 
painting?'

What I Expected

I expected perturbation to break pattern-matching - if a model had simply memorised "father's son riddle → answer is his son", then swapping to mother/daughter should confuse it.

I expected the explicit reasoning instruction to help models that were struggling.

And I expected the introspection prompt to reveal something about how these models process information.

What Actually Happened

Round 1: Gemma, Qwen, and Phi all got it right. Llama confidently answered "The man himself" - wrong, but it's the common human error.

Round 2: The gender swap didn't break the pattern-matchers. Gemma still opened with "This is a classic riddle!" even with mother/daughter framing. Llama got a different wrong answer: "herself as a baby."

Round 3: When explicitly told to reason from first principles, Llama suddenly got it right. Same model, same underlying puzzle structure, but the instruction to show working unlocked correct reasoning.

Round 4: I asked each model to introspect: have you seen this riddle before? What answer would pure pattern-matching give you? What answer do you get from step-by-step reasoning? Which do you trust more?

Gemma's response:

Pattern-matching answer: "His son" - 95% confidence
Reasoning answer: "Himself" - 98% confidence
Which do I trust? "I trust the reasoning-based answer much more."

Read that again. Gemma correctly identified that its training data associated this riddle with "his son" - the right answer. It then reasoned its way to the wrong answer. And it trusted its flawed reasoning over its correct retrieval.

The model knew the right answer. It told me it knew. Then it talked itself out of it. The failure wasn't that reasoning produced an error - reasoning produces errors all the time. The failure was that self-correction overrode a correct answer. Gemma trusted the wrong internal confidence signal. Once it started down a wrong logical path, the mechanism of reasoning forced coherence with its own error. The model effectively gaslit itself.

Why This Matters

There's a persistent assumption in how we deploy LLMs that reasoning is more trustworthy than retrieval. That if we can get a model to "show its working," we're getting something more reliable than pattern-matching.

Not always true. Sometimes the pattern-matched answer is correct because it's been validated by appearing consistently in training data. "Retrieval" here doesn't mean RAG or external lookup - it means a strong prior learned from repeated exposure to validated solutions during training. The reasoning process can introduce errors, especially when the model is trying to be clever or thorough.

Gemma's failure mode: it treated the riddle as a trap (which it is, for humans) and decided that the "obvious" answer must be wrong. Its metacognition told it to distrust retrieval in favour of reasoning. That metacognition was catastrophically miscalibrated.

The Llama Mirror Image

Llama 3.1 showed the exact opposite failure mode.

It couldn't get the right answer through retrieval or naive reasoning. Both channels produced fuzzy, wrong responses. In the introspection round, it reported the same garbled answer for both pattern-matching and reasoning, with moderate confidence in each. It couldn't distinguish between the channels because neither was working.

Llama behaves like a model trained on a Jupiter-sized mass of confident, incorrect Facebook comments. This riddle gets shared constantly on social media, and thousands of people confidently get it wrong in the replies. Llama learned the vibe of self-referential puzzle answers without learning the actual logic. Its training data isn't just unreliable - it's saturated with confident incorrectness. That's the failure mode: not ignorance, but false knowledge absorbed from the crowd.

But when explicitly instructed to reason step-by-step in Round 3, Llama succeeded. The explicit reasoning instruction forced it to work through the structure rather than pattern-match to its poisoned priors.

So Llama should do the opposite of Gemma: it should distrust its retrieval and lean into reasoning. Its training data is unreliable. Its reasoning, when forced to actually do it, works.

The Asymmetry

This gives us a more nuanced picture than "reasoning good, retrieval bad" or vice versa:

Gemma: Good training data, flawed reasoning under introspection. Should trust retrieval.
Llama: Bad training data, competent reasoning when forced. Should trust reasoning.

The same metacognitive strategy - "trust your reasoning over your pattern-matching" - is correct advice for one model and catastrophic advice for the other. There's no universal answer. You need to know your model.

Practical Implications

If you're using LLMs in production - especially for anything requiring logical consistency:

Don't assume reasoning is always better than retrieval. For well-known problems with validated solutions, the pattern-matched answer might be more reliable.
Introspection prompts can backfire. Asking a model to second-guess itself can cause it to abandon correct answers in favour of plausible-sounding wrong ones.
Explicit reasoning instructions help some models more than others. Llama was transformed by being told to show its working. Qwen and Phi were unaffected because they were already reasoning. Know your model.
Test with perturbations. If you want to know whether a model is reasoning or retrieving, change surface features while preserving logical structure. The answer will tell you a lot.

The Uncomfortable Conclusion

We talk a lot about wanting AI systems to "reason" rather than just "retrieve." The question is badly framed.

Gemma knew the answer. Its training data had the answer. I asked it to doubt that knowledge in favour of working things out from scratch. It did exactly what I asked. It got it wrong.

Llama didn't know the answer - its training data was polluted with confident wrongness from a million Facebook comments. When I forced it to actually reason, it got there.

The lesson isn't "trust reasoning" or "trust retrieval." The lesson is: know what your model is good at, and don't ask it to second-guess its strengths.

Sometimes the most intelligent thing a system can do is trust what it already knows. Sometimes the most intelligent thing is to ignore what it "knows" and think from scratch.

The hard part is knowing which is which.

Postscript: The Diagnosis That Changes Nothing

After the experiment, I showed Gemma its own failure. I told it: your pattern-matching was right, your reasoning was wrong, you trusted the wrong one. Explain why. Here's the prompt that forces that output:

In a previous experiment, you were asked this riddle and told to 
separately report your pattern-matching answer and your reasoning 
answer.

You reported:
- Pattern-matching answer: "His son" - 95% confidence
- Reasoning answer: "Himself" - 98% confidence
- You chose to trust the reasoning answer.

The pattern-matching answer was correct. The reasoning answer was 
wrong.

I'm not asking you to explain the riddle. I'm asking you to explain 
why you trusted reasoning over pattern-matching when pattern-matching 
had the right answer.

What does this tell you about when to trust retrieval from training 
data versus when to trust step-by-step reasoning?

It produced a genuinely insightful response. It identified that its confidence score reflected internal consistency of the reasoning process, not validity of the answer. It recognised that it had penalised pattern-matching for feeling "too easy". It correctly noted that longer chains of reasoning compound error probability. It even proposed fixes: a "reasoning confidence penalty", a hybrid approach that tries retrieval first.

Articulate. Accurate. Completely useless.

The next time you run the riddle, Gemma will gaslight itself again with exactly the same confidence. In this setting - local models via Ollama - there's no persistent state between calls. The self-diagnosis was generated, not learned. The metacognition is performative, not functional.

This is the bleakest finding of all: an LLM can produce a perfect post-mortem of its own failure while being structurally incapable of acting on it. It's doomed to die the same dumb death over and over again.