arXiv 2604.22074 · 2026-04-23

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

Qinan Yu, Alexa Tartaglini, Peter Hase

Introduces CIR (Causal Importance of Reasoning) and SR (Sufficiency of Reasoning) metrics. RLVR improves accuracy but does not reliably improve CIR or SR.

arxiv.org/abs/2604.22074 ↗

The paper introduces two metrics — Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) — to test whether RLVR-trained chain-of-thought actually drives the answer.

Finding: RLVR improves task accuracy but does not reliably improve CIR or SR. To fix this without losing accuracy, the authors propose either supervised fine-tuning combined with outcome rewards, or auxiliary CIR/SR rewards.

Practitioner note (mine)

This is a direct warning for anyone training reasoning models with RLVR. Your model can get more right answers while its reasoning trace becomes less causally faithful. Two implications:

Interpretability — if you rely on chain-of-thought for debugging or auditing, RLVR-trained traces may have decayed reliability without you noticing.
Agent safety — for agentic systems where you grant tool-use authority based on the model’s stated reasoning, post-hoc rationalization of unrelated answers becomes a real failure mode.

Adding CIR/SR-aware reward shaping to your eval suite is cheap insurance.