arXiv 2604.16529 · 2026-04-16

為 agentic coding 擴展 test-time compute

Joongwon Kim, Wannan Yang, Kelvin Niu

主張長 horizon coding agent 的 test-time scaling 是「表示問題」而非「取樣問題」。Claude Opus 4.5 在 SWE-Bench Verified 從 70.9% 升到 77.6%。

論文把 agentic coding 的 test-time compute（TTC）重新框架為表示問題而非取樣問題。每次 rollout 轉成結構化摘要（hypotheses、progress、failure modes）；多次 rollout 用 Recursive Tournament Voting（平行）與 Parallel-Distill-Refine（序列）合成。

數字（摘要所載）：mini-SWE-agent 上 Claude Opus 4.5 在 SWE-Bench Verified 從 70.9% → 77.6%，Terminal-Bench v2.0 從 46.9% → 59.1%。

實戰筆記（我的）

「結構化摘要作為 compute 單位」是我會實際採用的部分。多數 TTC 配方是「抽 N 樣本後投票」；這篇論文形式化了壓縮過往軌跡讓後續 rollout 站在已探索基礎上的方式。如果你跑多 rollout 的 agent harness，把原始 transcript 換成結構化的 per-rollout 摘要再投票，是工作量小但有可測收益的升級。

Recursive Tournament Voting 比較工程化、需要小心實作；但結構化摘要這個中間形式可以一天上線。