Skip to content
AI-Daily-Builder

arXiv 2605.08083·2026-05-21 views

AutoTTS — LLM agents discover their own test-time scaling strategies for $39.90

Tong Zheng, Haolin Liu, Chengsong Huang, Sheng Zhang, Hongming Zhang, Heng Huang · University of Maryland et al.

AutoTTS reframes test-time scaling as a controller-synthesis problem: an LLM agent discovers when to branch, continue, probe, prune, or stop — instead of hand-tuned best-of-N. The full discovery loop costs $39.90 / 160 min and beats manual baselines on the accuracy-cost tradeoff.

arxiv.org/abs/2605.08083 ↗


AutoTTS (arXiv 2605.08083) attacks a problem every team running reasoning models hits: how much test-time compute should you spend per query, and how? Today that’s hand-tuned — fixed best-of-N sampling, static tree search, a magic number someone picked. AutoTTS makes the recipe itself something an LLM agent discovers automatically.

The reframe

Test-time scaling (TTS) is reframed as a controller-synthesis problem over pre-collected reasoning trajectories. A learned controller decides, at each step, whether to:

A beta-parameterization plus execution-trace feedback makes the discovery loop efficient.

The headline number

The entire strategy-discovery process costs $39.90 and 160 minutes of compute — and the discovered controllers:

Result
vs. hand-designed baselinesbeat them on the accuracy-cost tradeoff
Generalizationtransfers to held-out math benchmarks
Model scalesworks across different model sizes

Why it matters

Most test-time-compute work hand-tunes one inference recipe per task. AutoTTS treats orchestration logic as searchable, pointing toward self-improving inference pipelines where the how-much-to-think policy is learned, not authored.

The near-trivial discovery cost is the real signal: at ~$40 a run, this kind of meta-optimization is cheap enough to run routinely — per product, per task class, per model upgrade — rather than as a one-off research artifact. It connects directly to the Recursive Superintelligence thesis: AI optimizing its own inference loop, at a cost that makes it a default rather than an experiment.

Practitioner note

For teams running reasoning models in production:

The under-considered angle: the orchestration layer is the next thing to be automated, after the model and the prompt. We’ve watched prompt engineering get systematized and model selection become routing. Test-time compute orchestration is the last hand-tuned layer in the inference stack — and AutoTTS is an early signal that it, too, becomes a learned policy rather than a human guess.

Tip