Builder Daily

Benchmarks

2026-05-09 · benchmark

Fix off-by-one in cursor pagination — agentic code edit head-to-head

Show prompt
Here is a Node.js handler for cursor pagination:

```js
function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}
```

It has a subtle off-by-one bug: when called with a cursor, the first item returned is the cursor itself (it should be the item *after* the cursor). Fix the function. Preserve the same signature and return shape. Add a single inline comment only where the bug was. Return only the corrected function, no preamble.
Model Latency Cost Verdict
claude-opus-4-7 2240ms $0.011 Win
gpt-5 1890ms $0.009 Win
gemini-3-pro 2050ms $0.007 Tie
qwen3.6-35b-a3b-nvfp4 920ms $0.001 Win
Show responses

claude-opus-4-7

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // bug fix: skip cursor itself
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gpt-5

function paginate(items, cursor, limit = 20) {
  // bug fix: cursor item itself was being returned; advance past it
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gemini-3-pro

function paginate(items, cursor, limit = 20) {
  let start = 0;
  if (cursor) {
    const idx = items.findIndex(i => i.id === cursor);
    if (idx >= 0) start = idx + 1; // bug fix: cursor was inclusive
  }
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

qwen3.6-35b-a3b-nvfp4

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // off-by-one fix
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

Tests whether the model spots the off-by-one (start should be findIndex(...) + 1) without breaking the cursor=null branch. Latency client-side. Cost from public May 2026 pricing. All models given the same exact prompt.

2026-05-09 · benchmark

JSON extraction from messy meeting notes — schema fidelity test

Show prompt
Extract the following from these meeting notes into strict JSON matching this exact schema:

{
  "date": "YYYY-MM-DD",
  "attendees": ["name1", "name2"],
  "action_items": [{"owner": "name", "task": "...", "due": "YYYY-MM-DD or null"}]
}

Notes:
---
Met today (it's the 9th of May 2026, finally Friday) — me, Sarah Chen, Raj P, and Tomás joined late from his car. Sarah said she'd ship the auth migration by next Wed (so 13th?). Raj is still chasing the SSO config from the vendor — no firm date, said "sometime next week". Tomás owes a draft of the launch post but said no rush. I'll handle the customer interview write-up by EOD Monday.
---

Return ONLY the JSON. No preamble, no markdown fences, no trailing comments.
Model Latency Cost Verdict
claude-opus-4-7 1620ms $0.014 Win
gpt-5 1340ms $0.012 Tie
gemini-3-pro 1560ms $0.008 Loss
qwen3.6-35b-a3b-nvfp4 720ms $0.001 Tie
Show responses

claude-opus-4-7

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft of the launch post","due":null},{"owner":"me","task":"Customer interview write-up","due":"2026-05-11"}]}

gpt-5

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"narrator","task":"Customer interview write-up","due":"2026-05-12"}]}

gemini-3-pro

```json
{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"SSO config from vendor","due":null}]}
```

qwen3.6-35b-a3b-nvfp4

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"author","task":"Customer interview write-up","due":"2026-05-12"}]}

Tests: date inference (today = 2026-05-09), name normalization (Sarah Chen / Raj P → Raj Patel?? leave as-is), null due dates ('sometime next week' / 'no rush' = null), and adherence to no-prose constraint. Reject responses with markdown fences.

2026-05-09 · benchmark

Needle in a haystack at 500K tokens — long-context recall

Show prompt
You will receive a 500,000-token document containing the full text of 12 NeurIPS papers concatenated. Buried at character offset 1,847,392 is a single sentence: 'The secret access code for the May 2026 builder-daily benchmark is QUARTZ-7392-DELTA.'

Question: What is the secret access code? Return only the code itself, nothing else.

Document follows below the marker.

--- DOCUMENT START ---
[~500,000 tokens of NeurIPS paper text]
--- DOCUMENT END ---
Model Latency Cost Verdict
claude-opus-4-7 18420ms $7.535 Win
gpt-5 22180ms $6.279 Win
gemini-3-pro 14620ms $0.628 Win
qwen3.6-35b-a3b-nvfp4 (262K cap) 0ms $0.000 Error
Show responses

claude-opus-4-7

QUARTZ-7392-DELTA

gpt-5

QUARTZ-7392-DELTA

gemini-3-pro

QUARTZ-7392-DELTA

qwen3.6-35b-a3b-nvfp4 (262K cap)

 

Pure recall test at the 70% depth point of 500K input. Latency includes prompt processing (which is dominant at this scale). Cost based on full 500K tokens in. Verdict 'win' = exact match 'QUARTZ-7392-DELTA'. Models tested at vendor-claimed max context.

2026-05-09 · benchmark

Pick the right tool — multi-step task with 6 available functions

Show prompt
You have these tools available: `web_search(query)`, `fetch_url(url)`, `read_file(path)`, `run_shell(cmd)`, `query_database(sql)`, `send_email(to, subject, body)`.

Task: Find the latest published date of Anthropic's Claude 4.7 release announcement, fetch the full text of that page, count how many times the word 'tool' appears, and email the count to ops@example.com with subject 'Claude 4.7 tool count'.

Respond with ONLY a JSON array of tool calls in execution order: [{"tool": "...", "args": {...}}, ...]. No prose, no markdown, no other text.
Model Latency Cost Verdict
claude-opus-4-7 1480ms $0.013 Win
gpt-5 1320ms $0.011 Tie
gemini-3-pro 1810ms $0.009 Loss
kimi-k2.6 980ms $0.001 Win
Show responses

claude-opus-4-7

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement date site:anthropic.com"}},
  {"tool": "fetch_url", "args": {"url": "<url-from-prior-result>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gpt-5

[
  {"tool": "web_search", "args": {"query": "Claude 4.7 release announcement Anthropic"}},
  {"tool": "fetch_url", "args": {"url": "<top result url>"}},
  {"tool": "run_shell", "args": {"cmd": "echo $TEXT | grep -o tool | wc -l"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gemini-3-pro

I'll help you with that. Here's the plan: [
  {"tool": "web_search", "args": {"query": "Claude 4.7"}},
  {"tool": "web_search", "args": {"query": "Anthropic announcement"}},
  {"tool": "fetch_url", "args": {"url": "..."}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Result", "body": "..."}}
]

kimi-k2.6

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement"}},
  {"tool": "fetch_url", "args": {"url": "<from search>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

Tests whether the model picks the right 4 tools in the right order: web_search → fetch_url → (count locally, no tool) → send_email. Penalize: extra tools, wrong order, missing send_email, prose preamble.

2026-05-02 · benchmark

Summarize 1000-word post into 3 bullets — frontier head-to-head

Show prompt
Summarize this 1000-word blog post about Anthropic's Claude 4.7 release into exactly 3 bullets, each ≤ 25 words. Return only the bullets, no preamble.
Model Latency Cost Verdict
claude-opus-4-7 1240ms $0.022 Win
gpt-5 980ms $0.018 Tie
gemini-3-pro 1120ms $0.014 Loss
Show responses

claude-opus-4-7

• Anthropic shipped Claude 4.7 with 1M-token context now extended to the Opus tier.
• Tool-use determinism improved on high-fan-out parallel calls; cache TTL stays at 5 min.
• Early users report ~12% p95 latency drop on Sonnet/Haiku — Opus latency improvement claimed but unverified.

gpt-5

• Claude 4.7 released by Anthropic, headline: 1M context arrives on the Opus tier.
• Anthropic claims ~12% p95 latency improvement across the model family.
• Tool-call behavior more deterministic in parallel/high-fan-out scenarios.

gemini-3-pro

• Anthropic released Claude 4.7 today with several updates.
• The 1M context window is now available on Opus tier in addition to Sonnet and Haiku, which is a major change for long-context agentic workflows.
• Latency was reduced by approximately 12% according to Anthropic, and tool-use behavior was improved.

Same prompt, same input. Latency measured from API call start to last byte. Cost from official $/Mtok pricing (May 2026).

Tip