Benchmarks

2026-05-09 · benchmark

Fix off-by-one in cursor pagination — agentic code edit head-to-head

Show prompt

Here is a Node.js handler for cursor pagination:

```js
function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}
```

It has a subtle off-by-one bug: when called with a cursor, the first item returned is the cursor itself (it should be the item *after* the cursor). Fix the function. Preserve the same signature and return shape. Add a single inline comment only where the bug was. Return only the corrected function, no preamble.

Model	Latency	Cost	Verdict
claude-opus-4-7	2240ms	$0.011	Win
gpt-5	1890ms	$0.009	Win
gemini-3-pro	2050ms	$0.007	Tie
qwen3.6-35b-a3b-nvfp4	920ms	$0.001	Win

Show responses

claude-opus-4-7

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // bug fix: skip cursor itself
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gpt-5

function paginate(items, cursor, limit = 20) {
  // bug fix: cursor item itself was being returned; advance past it
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gemini-3-pro

function paginate(items, cursor, limit = 20) {
  let start = 0;
  if (cursor) {
    const idx = items.findIndex(i => i.id === cursor);
    if (idx >= 0) start = idx + 1; // bug fix: cursor was inclusive
  }
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

qwen3.6-35b-a3b-nvfp4

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // off-by-one fix
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

Tests whether the model spots the off-by-one (start should be findIndex(...) + 1) without breaking the cursor=null branch. Latency client-side. Cost from public May 2026 pricing. All models given the same exact prompt.

2026-05-09 · benchmark

JSON extraction from messy meeting notes — schema fidelity test

Show prompt

Extract the following from these meeting notes into strict JSON matching this exact schema:

{
  "date": "YYYY-MM-DD",
  "attendees": ["name1", "name2"],
  "action_items": [{"owner": "name", "task": "...", "due": "YYYY-MM-DD or null"}]
}

Notes:
---
Met today (it's the 9th of May 2026, finally Friday) — me, Sarah Chen, Raj P, and Tomás joined late from his car. Sarah said she'd ship the auth migration by next Wed (so 13th?). Raj is still chasing the SSO config from the vendor — no firm date, said "sometime next week". Tomás owes a draft of the launch post but said no rush. I'll handle the customer interview write-up by EOD Monday.
---

Return ONLY the JSON. No preamble, no markdown fences, no trailing comments.

Model	Latency	Cost	Verdict
claude-opus-4-7	1620ms	$0.014	Win
gpt-5	1340ms	$0.012	Tie
gemini-3-pro	1560ms	$0.008	Loss
qwen3.6-35b-a3b-nvfp4	720ms	$0.001	Tie

Show responses

claude-opus-4-7

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft of the launch post","due":null},{"owner":"me","task":"Customer interview write-up","due":"2026-05-11"}]}

gpt-5

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"narrator","task":"Customer interview write-up","due":"2026-05-12"}]}

gemini-3-pro

```json
{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"SSO config from vendor","due":null}]}
```

qwen3.6-35b-a3b-nvfp4

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"author","task":"Customer interview write-up","due":"2026-05-12"}]}

Tests: date inference (today = 2026-05-09), name normalization (Sarah Chen / Raj P → Raj Patel?? leave as-is), null due dates ('sometime next week' / 'no rush' = null), and adherence to no-prose constraint. Reject responses with markdown fences.

2026-05-09 · benchmark

Needle in a haystack at 500K tokens — long-context recall

Show prompt

You will receive a 500,000-token document containing the full text of 12 NeurIPS papers concatenated. Buried at character offset 1,847,392 is a single sentence: 'The secret access code for the May 2026 builder-daily benchmark is QUARTZ-7392-DELTA.'

Question: What is the secret access code? Return only the code itself, nothing else.

Document follows below the marker.

--- DOCUMENT START ---
[~500,000 tokens of NeurIPS paper text]
--- DOCUMENT END ---

Model	Latency	Cost	Verdict
claude-opus-4-7	18420ms	$7.535	Win
gpt-5	22180ms	$6.279	Win
gemini-3-pro	14620ms	$0.628	Win
qwen3.6-35b-a3b-nvfp4 (262K cap)	0ms	$0.000	Error

Show responses

claude-opus-4-7

QUARTZ-7392-DELTA

gpt-5

QUARTZ-7392-DELTA

gemini-3-pro

QUARTZ-7392-DELTA

qwen3.6-35b-a3b-nvfp4 (262K cap)

Pure recall test at the 70% depth point of 500K input. Latency includes prompt processing (which is dominant at this scale). Cost based on full 500K tokens in. Verdict 'win' = exact match 'QUARTZ-7392-DELTA'. Models tested at vendor-claimed max context.

2026-05-09 · benchmark

Pick the right tool — multi-step task with 6 available functions

Show prompt

You have these tools available: `web_search(query)`, `fetch_url(url)`, `read_file(path)`, `run_shell(cmd)`, `query_database(sql)`, `send_email(to, subject, body)`.

Task: Find the latest published date of Anthropic's Claude 4.7 release announcement, fetch the full text of that page, count how many times the word 'tool' appears, and email the count to ops@example.com with subject 'Claude 4.7 tool count'.

Respond with ONLY a JSON array of tool calls in execution order: [{"tool": "...", "args": {...}}, ...]. No prose, no markdown, no other text.

Model	Latency	Cost	Verdict
claude-opus-4-7	1480ms	$0.013	Win
gpt-5	1320ms	$0.011	Tie
gemini-3-pro	1810ms	$0.009	Loss
kimi-k2.6	980ms	$0.001	Win

Show responses

claude-opus-4-7

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement date site:anthropic.com"}},
  {"tool": "fetch_url", "args": {"url": "<url-from-prior-result>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gpt-5

[
  {"tool": "web_search", "args": {"query": "Claude 4.7 release announcement Anthropic"}},
  {"tool": "fetch_url", "args": {"url": "<top result url>"}},
  {"tool": "run_shell", "args": {"cmd": "echo $TEXT | grep -o tool | wc -l"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gemini-3-pro

I'll help you with that. Here's the plan: [
  {"tool": "web_search", "args": {"query": "Claude 4.7"}},
  {"tool": "web_search", "args": {"query": "Anthropic announcement"}},
  {"tool": "fetch_url", "args": {"url": "..."}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Result", "body": "..."}}
]

kimi-k2.6

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement"}},
  {"tool": "fetch_url", "args": {"url": "<from search>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

Tests whether the model picks the right 4 tools in the right order: web_search → fetch_url → (count locally, no tool) → send_email. Penalize: extra tools, wrong order, missing send_email, prose preamble.

2026-05-02 · benchmark

Summarize 1000-word post into 3 bullets — frontier head-to-head

Show prompt

Summarize this 1000-word blog post about Anthropic's Claude 4.7 release into exactly 3 bullets, each ≤ 25 words. Return only the bullets, no preamble.

Model	Latency	Cost	Verdict
claude-opus-4-7	1240ms	$0.022	Win
gpt-5	980ms	$0.018	Tie
gemini-3-pro	1120ms	$0.014	Loss

Show responses

claude-opus-4-7

• Anthropic shipped Claude 4.7 with 1M-token context now extended to the Opus tier.
• Tool-use determinism improved on high-fan-out parallel calls; cache TTL stays at 5 min.
• Early users report ~12% p95 latency drop on Sonnet/Haiku — Opus latency improvement claimed but unverified.

gpt-5

• Claude 4.7 released by Anthropic, headline: 1M context arrives on the Opus tier.
• Anthropic claims ~12% p95 latency improvement across the model family.
• Tool-call behavior more deterministic in parallel/high-fan-out scenarios.

gemini-3-pro

• Anthropic released Claude 4.7 today with several updates.
• The 1M context window is now available on Opus tier in addition to Sonnet and Haiku, which is a major change for long-context agentic workflows.
• Latency was reduced by approximately 12% according to Anthropic, and tool-use behavior was improved.

Same prompt, same input. Latency measured from API call start to last byte. Cost from official $/Mtok pricing (May 2026).