arXiv 2605.08083·2026-05-21 — views
AutoTTS — LLM agents discover their own test-time scaling strategies for $39.90
Tong Zheng, Haolin Liu, Chengsong Huang, Sheng Zhang, Hongming Zhang, Heng Huang · University of Maryland et al.
AutoTTS reframes test-time scaling as a controller-synthesis problem: an LLM agent discovers when to branch, continue, probe, prune, or stop — instead of hand-tuned best-of-N. The full discovery loop costs $39.90 / 160 min and beats manual baselines on the accuracy-cost tradeoff.
AutoTTS (arXiv 2605.08083) attacks a problem every team running reasoning models hits: how much test-time compute should you spend per query, and how? Today that’s hand-tuned — fixed best-of-N sampling, static tree search, a magic number someone picked. AutoTTS makes the recipe itself something an LLM agent discovers automatically.
The reframe
Test-time scaling (TTS) is reframed as a controller-synthesis problem over pre-collected reasoning trajectories. A learned controller decides, at each step, whether to:
- branch (explore multiple continuations)
- continue (extend the current path)
- probe (cheap lookahead)
- prune (kill a weak branch)
- stop (commit to an answer)
A beta-parameterization plus execution-trace feedback makes the discovery loop efficient.
The headline number
The entire strategy-discovery process costs $39.90 and 160 minutes of compute — and the discovered controllers:
| Result | |
|---|---|
| vs. hand-designed baselines | beat them on the accuracy-cost tradeoff |
| Generalization | transfers to held-out math benchmarks |
| Model scales | works across different model sizes |
Why it matters
Most test-time-compute work hand-tunes one inference recipe per task. AutoTTS treats orchestration logic as searchable, pointing toward self-improving inference pipelines where the how-much-to-think policy is learned, not authored.
The near-trivial discovery cost is the real signal: at ~$40 a run, this kind of meta-optimization is cheap enough to run routinely — per product, per task class, per model upgrade — rather than as a one-off research artifact. It connects directly to the Recursive Superintelligence thesis: AI optimizing its own inference loop, at a cost that makes it a default rather than an experiment.
Practitioner note
For teams running reasoning models in production:
- Audit whether your test-time compute is hand-tuned. If you picked best-of-8 because it “felt right,” you’re likely on the wrong point of the accuracy-cost curve. AutoTTS-style discovery finds the point empirically.
- The cost lever is per-query compute, and it’s underexploited. Most teams optimize the model choice and the prompt, then leave the inference orchestration static. That’s the layer AutoTTS shows is learnable — and where the cost savings hide.
- Watch for this landing in inference frameworks. A $40 discovery loop is cheap enough that vLLM/SGLang-style serving stacks could ship learned TTS controllers as a feature. If they do, hand-tuned best-of-N becomes legacy.
The under-considered angle: the orchestration layer is the next thing to be automated, after the model and the prompt. We’ve watched prompt engineering get systematized and model selection become routing. Test-time compute orchestration is the last hand-tuned layer in the inference stack — and AutoTTS is an early signal that it, too, becomes a learned policy rather than a human guess.