arXiv 2606.02907·2026-06-07 — views

Your "Reasoning Probe" May Just Be Reading the Format: A Cautionary arXiv Result

Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

A new arXiv paper (2606.02907) shows that linear probes hitting 100% accuracy at separating deductive, inductive, and abductive reasoning in Qwen3-14B hidden states collapse to chance once you control for task-format confounds like source dataset, option count, and response

arxiv.org/abs/2606.02907 ↗

What the paper claims

A common move in interpretability research: train a small linear classifier (“linear probe”) on a frozen LLM’s hidden states, show it separates concept A from concept B with high accuracy, and conclude the model “represents” that distinction internally. The arXiv preprint “Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States” (2606.02907, submitted June 1, 2026, cs.CL) argues this conclusion is often unearned.

The authors probe Qwen3-14B across three reasoning families — deductive (LogiQA 2.0), inductive (ARC-Challenge), and abductive (alpha-NLI). At layer 32 of 40, a linear probe hits 100% cross-validated accuracy with cleanly separated geometry. It looks like a textbook “the model encodes reasoning mode” result.

Then they stress-test it. The three datasets differ not just in reasoning type but in surface format: which corpus the item came from (source identity), how many answer options it has (option count), and how long the response runs (response length). When the authors residualize the hidden states against those three confounds — removing the variance that format alone can explain — probe accuracy drops to chance. Causal steering along the “reasoning-mode” direction produces no functional effect (reported p = 0.286). Their conclusion: the geometry was tracking task format, not a computational reasoning mode, and the underlying reasoning representations appear largely shared across the three tasks.

Why it matters

This is a methodological landmine that affects a lot of published interpretability work. The pattern “probe accuracy is high, therefore the concept is represented” is everywhere — in claims about truthfulness directions, sentiment neurons, refusal directions, and reasoning-type encodings. If your positive and negative examples differ in any incidental way (length, formatting, which benchmark they came from, even tokenization quirks), a linear probe can latch onto that shortcut and still post near-perfect numbers.

The fix the authors push is simple to state and uncomfortable to adopt: routine format deconfounding. Before believing a probe, residualize against obvious nuisance variables and re-check. They also lean on causal steering as a sanity test — if intervening on the supposed concept direction does not change behavior, the direction probably is not the concept. The intrinsic-dimensionality numbers they report per task (20.6, 28.5, 33.6) and the near-chance trace-anchor agreement (42.5% vs 33.3% chance) all point the same way: the separability was structural, not semantic.

A useful mental model: a probe measures decodability, not use. Information being linearly recoverable from activations says nothing about whether the model relies on it, or whether it reflects a clean internal concept versus a correlated artifact of how you built your dataset.

Stage	What they did	Result
Naive probe	Linear classifier at layer 32, three reasoning datasets	100% accuracy, well-separated
Deconfounded	Residualize on source, option count, response length	Drops to chance
Causal check	Steer along the recovered direction	No effect (p = 0.286)

Practitioner note

If you run probes — for interpretability, for a “lie detector” classifier, or for routing — treat a high accuracy number as a hypothesis, not a finding. Build a confound checklist for your dataset (source, length, option count, label-balanced templates), residualize against it, and only trust the residual signal. Pair every probe claim with a causal intervention: if steering the direction does not move behavior, you have a correlation, not a mechanism. And prefer matched-format negatives — same template, same length distribution, same corpus — so the only thing that varies is the concept you care about.

Under-considered angle

The deeper unease is that this failure mode scales with model capability, not against it. Larger models encode richer surface statistics, so they make incidental format features more linearly separable, which makes spurious probes look more convincing precisely on the frontier systems people most want to interpret. That inverts the usual intuition that bigger, better models are easier to study. It also raises a quiet question for safety tooling built on probes — truthfulness or deception classifiers, for instance — where a clean accuracy curve might be reading the prompt’s shape rather than the model’s intent, and would fail silently the moment an adversary controls the format.