bench-runner — replays the benchmark suite when new models drop

Re-runs every benchmark JSON in /benchmarks/ against new model endpoints, captures latency/cost/output, opens a PR with diffs vs prior leader.

cp .claude/agents/bench-runner.md ~/.claude/agents/

What this agent does

When a new frontier model ships, bench-runner replays every existing benchmark in /benchmarks/ against the new endpoint, captures structured results, and opens a PR with diff annotations so the operator can see what changed without manual transcription.

How it ensures fairness

Same prompt, byte-identical. No paraphrasing or “tuning” per provider.
Same input. Long-context benchmarks load the same source document.
Latency measured client-side, from API call start to last byte received. Network jitter normalized by running 3× and taking median.
Cost computed from public $/Mtok pricing at run time, captured in a pricing.json snapshot for reproducibility.

Per-run artifacts

For each model × benchmark cell:

{
  "model": "claude-opus-4-7",
  "latency_ms": 1240,
  "tokens_in": 1280,
  "tokens_out": 78,
  "cost_usd": 0.022,
  "verdict": "win",
  "response": "...",
  "_provenance": {
    "endpoint_version": "2026-05-08",
    "pricing_snapshot": "2026-05-09T14:23Z",
    "median_of": 3
  }
}

The _provenance block is stored in the JSON but not rendered on the public page — it exists so future agents can re-validate that the numbers were captured under known conditions.

When to run

Any frontier model release (Anthropic, OpenAI, Google, xAI, Mistral, DeepSeek, Qwen, Kimi, GLM, MiniMax)
Pricing changes (re-run cost columns only)
Benchmark prompt updates (must re-run all rows for that benchmark)