arXiv 2605.08083·2026-05-21 — views

AutoTTS — LLM agents discover their own test-time scaling strategies for $39.90

Tong Zheng, Haolin Liu, Chengsong Huang, Sheng Zhang, Hongming Zhang, Heng Huang · University of Maryland et al.

AutoTTS reframes test-time scaling as a controller-synthesis problem: an LLM agent discovers when to branch, continue, probe, prune, or stop — instead of hand-tuned best-of-N. The full discovery loop costs $39.90 / 160 min and beats manual baselines on the accuracy-cost tradeoff.

arxiv.org/abs/2605.08083 ↗

AutoTTS (arXiv 2605.08083) attacks a problem every team running reasoning models hits: how much test-time compute should you spend per query, and how? Today that’s hand-tuned — fixed best-of-N sampling, static tree search, a magic number someone picked. AutoTTS makes the recipe itself something an LLM agent discovers automatically.

The reframe

Test-time scaling (TTS) is reframed as a controller-synthesis problem over pre-collected reasoning trajectories. A learned controller decides, at each step, whether to:

branch (explore multiple continuations)
continue (extend the current path)
probe (cheap lookahead)
prune (kill a weak branch)
stop (commit to an answer)

A beta-parameterization plus execution-trace feedback makes the discovery loop efficient.

The headline number

The entire strategy-discovery process costs $39.90 and 160 minutes of compute — and the discovered controllers:

	Result
vs. hand-designed baselines	beat them on the accuracy-cost tradeoff
Generalization	transfers to held-out math benchmarks
Model scales	works across different model sizes

Why it matters

Most test-time-compute work hand-tunes one inference recipe per task. AutoTTS treats orchestration logic as searchable, pointing toward self-improving inference pipelines where the how-much-to-think policy is learned, not authored.

The near-trivial discovery cost is the real signal: at ~$40 a run, this kind of meta-optimization is cheap enough to run routinely — per product, per task class, per model upgrade — rather than as a one-off research artifact. It connects directly to the Recursive Superintelligence thesis: AI optimizing its own inference loop, at a cost that makes it a default rather than an experiment.

Practitioner note

For teams running reasoning models in production:

Audit whether your test-time compute is hand-tuned. If you picked best-of-8 because it “felt right,” you’re likely on the wrong point of the accuracy-cost curve. AutoTTS-style discovery finds the point empirically.
The cost lever is per-query compute, and it’s underexploited. Most teams optimize the model choice and the prompt, then leave the inference orchestration static. That’s the layer AutoTTS shows is learnable — and where the cost savings hide.
Watch for this landing in inference frameworks. A $40 discovery loop is cheap enough that vLLM/SGLang-style serving stacks could ship learned TTS controllers as a feature. If they do, hand-tuned best-of-N becomes legacy.

The under-considered angle: the orchestration layer is the next thing to be automated, after the model and the prompt. We’ve watched prompt engineering get systematized and model selection become routing. Test-time compute orchestration is the last hand-tuned layer in the inference stack — and AutoTTS is an early signal that it, too, becomes a learned policy rather than a human guess.