arXiv 2604.19295 · 2026-04-21

TEMPO: Scaling Test-time Training for Large Reasoning Models

Qingyang Zhang, Xinke Kong, Haitao Wu

Test-time training framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration. OLMO3-7B 33.0% → 51.1% on AIME 2024.

arxiv.org/abs/2604.19295 ↗

TEMPO is a test-time training (TTT) framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset, formalized via EM.

Reported numbers: lifts OLMO3-7B from 33.0% to 51.1% on AIME 2024 and Qwen3-14B from 42.3% to 65.8%, while preserving output diversity.

Practitioner note (mine)

TTT — actually updating model parameters at inference time — has been a research curiosity for years. With these numbers it’s becoming a real lever for reasoning gains.

For builders this matters mostly as a strategic question: when does it make more sense to fine-tune offline vs. do online adaptation in production? TEMPO suggests the answer is shifting toward online for hard reasoning tasks where the cost of a wrong answer is high (math, theorem proving, complex code review). For low-stakes throughput work, offline still wins on cost.