arXiv 2604.19295 · 2026-04-21
TEMPO: Scaling Test-time Training for Large Reasoning Models
Qingyang Zhang, Xinke Kong, Haitao Wu
Test-time training framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration. OLMO3-7B 33.0% → 51.1% on AIME 2024.
TEMPO is a test-time training (TTT) framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset, formalized via EM.
Reported numbers: lifts OLMO3-7B from 33.0% to 51.1% on AIME 2024 and Qwen3-14B from 42.3% to 65.8%, while preserving output diversity.
Practitioner note (mine)
TTT — actually updating model parameters at inference time — has been a research curiosity for years. With these numbers it’s becoming a real lever for reasoning gains.
For builders this matters mostly as a strategic question: when does it make more sense to fine-tune offline vs. do online adaptation in production? TEMPO suggests the answer is shifting toward online for hard reasoning tasks where the cost of a wrong answer is high (math, theorem proving, complex code review). For low-stakes throughput work, offline still wins on cost.