Builder Daily

bench-runner — replays the benchmark suite when new models drop

Re-runs every benchmark JSON in /benchmarks/ against new model endpoints, captures latency/cost/output, opens a PR with diffs vs prior leader.

cp .claude/agents/bench-runner.md ~/.claude/agents/

What this agent does

When a new frontier model ships, bench-runner replays every existing benchmark in /benchmarks/ against the new endpoint, captures structured results, and opens a PR with diff annotations so the operator can see what changed without manual transcription.

How it ensures fairness

Per-run artifacts

For each model × benchmark cell:

{
  "model": "claude-opus-4-7",
  "latency_ms": 1240,
  "tokens_in": 1280,
  "tokens_out": 78,
  "cost_usd": 0.022,
  "verdict": "win",
  "response": "...",
  "_provenance": {
    "endpoint_version": "2026-05-08",
    "pricing_snapshot": "2026-05-09T14:23Z",
    "median_of": 3
  }
}

The _provenance block is stored in the JSON but not rendered on the public page — it exists so future agents can re-validate that the numbers were captured under known conditions.

When to run

チップ