bench-runner — replays the benchmark suite when new models drop
Re-runs every benchmark JSON in /benchmarks/ against new model endpoints, captures latency/cost/output, opens a PR with diffs vs prior leader.
cp .claude/agents/bench-runner.md ~/.claude/agents/ What this agent does
When a new frontier model ships, bench-runner replays every existing benchmark in /benchmarks/ against the new endpoint, captures structured results, and opens a PR with diff annotations so the operator can see what changed without manual transcription.
How it ensures fairness
- Same prompt, byte-identical. No paraphrasing or “tuning” per provider.
- Same input. Long-context benchmarks load the same source document.
- Latency measured client-side, from API call start to last byte received. Network jitter normalized by running 3× and taking median.
- Cost computed from public $/Mtok pricing at run time, captured in a
pricing.jsonsnapshot for reproducibility.
Per-run artifacts
For each model × benchmark cell:
{
"model": "claude-opus-4-7",
"latency_ms": 1240,
"tokens_in": 1280,
"tokens_out": 78,
"cost_usd": 0.022,
"verdict": "win",
"response": "...",
"_provenance": {
"endpoint_version": "2026-05-08",
"pricing_snapshot": "2026-05-09T14:23Z",
"median_of": 3
}
}
The _provenance block is stored in the JSON but not rendered on the public page — it exists so future agents can re-validate that the numbers were captured under known conditions.
When to run
- Any frontier model release (Anthropic, OpenAI, Google, xAI, Mistral, DeepSeek, Qwen, Kimi, GLM, MiniMax)
- Pricing changes (re-run cost columns only)
- Benchmark prompt updates (must re-run all rows for that benchmark)