ベンチマーク

2026-05-09 · benchmark

Fix off-by-one in cursor pagination — agentic code edit head-to-head

顯示 prompt

Here is a Node.js handler for cursor pagination:

```js
function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}
```

It has a subtle off-by-one bug: when called with a cursor, the first item returned is the cursor itself (it should be the item *after* the cursor). Fix the function. Preserve the same signature and return shape. Add a single inline comment only where the bug was. Return only the corrected function, no preamble.

Model	Latency	成本	評語
claude-opus-4-7	2240ms	$0.011	勝
gpt-5	1890ms	$0.009	勝
gemini-3-pro	2050ms	$0.007	平
qwen3.6-35b-a3b-nvfp4	920ms	$0.001	勝

顯示回答

claude-opus-4-7

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // bug fix: skip cursor itself
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gpt-5

function paginate(items, cursor, limit = 20) {
  // bug fix: cursor item itself was being returned; advance past it
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gemini-3-pro

function paginate(items, cursor, limit = 20) {
  let start = 0;
  if (cursor) {
    const idx = items.findIndex(i => i.id === cursor);
    if (idx >= 0) start = idx + 1; // bug fix: cursor was inclusive
  }
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

qwen3.6-35b-a3b-nvfp4

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // off-by-one fix
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

Tests whether the model spots the off-by-one (start should be findIndex(...) + 1) without breaking the cursor=null branch. Latency client-side. Cost from public May 2026 pricing. All models given the same exact prompt.

2026-05-09 · benchmark

JSON extraction from messy meeting notes — schema fidelity test

顯示 prompt

Extract the following from these meeting notes into strict JSON matching this exact schema:

{
  "date": "YYYY-MM-DD",
  "attendees": ["name1", "name2"],
  "action_items": [{"owner": "name", "task": "...", "due": "YYYY-MM-DD or null"}]
}

Notes:
---
Met today (it's the 9th of May 2026, finally Friday) — me, Sarah Chen, Raj P, and Tomás joined late from his car. Sarah said she'd ship the auth migration by next Wed (so 13th?). Raj is still chasing the SSO config from the vendor — no firm date, said "sometime next week". Tomás owes a draft of the launch post but said no rush. I'll handle the customer interview write-up by EOD Monday.
---

Return ONLY the JSON. No preamble, no markdown fences, no trailing comments.

Model	Latency	成本	評語
claude-opus-4-7	1620ms	$0.014	勝
gpt-5	1340ms	$0.012	平
gemini-3-pro	1560ms	$0.008	負
qwen3.6-35b-a3b-nvfp4	720ms	$0.001	平

顯示回答

claude-opus-4-7

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft of the launch post","due":null},{"owner":"me","task":"Customer interview write-up","due":"2026-05-11"}]}

gpt-5

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"narrator","task":"Customer interview write-up","due":"2026-05-12"}]}

gemini-3-pro

```json
{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"SSO config from vendor","due":null}]}
```

qwen3.6-35b-a3b-nvfp4

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"author","task":"Customer interview write-up","due":"2026-05-12"}]}

Tests: date inference (today = 2026-05-09), name normalization (Sarah Chen / Raj P → Raj Patel?? leave as-is), null due dates ('sometime next week' / 'no rush' = null), and adherence to no-prose constraint. Reject responses with markdown fences.

2026-05-09 · benchmark

Needle in a haystack at 500K tokens — long-context recall

顯示 prompt

You will receive a 500,000-token document containing the full text of 12 NeurIPS papers concatenated. Buried at character offset 1,847,392 is a single sentence: 'The secret access code for the May 2026 builder-daily benchmark is QUARTZ-7392-DELTA.'

Question: What is the secret access code? Return only the code itself, nothing else.

Document follows below the marker.

--- DOCUMENT START ---
[~500,000 tokens of NeurIPS paper text]
--- DOCUMENT END ---

Model	Latency	成本	評語
claude-opus-4-7	18420ms	$7.535	勝
gpt-5	22180ms	$6.279	勝
gemini-3-pro	14620ms	$0.628	勝
qwen3.6-35b-a3b-nvfp4 (262K cap)	0ms	$0.000	錯誤

顯示回答

claude-opus-4-7

QUARTZ-7392-DELTA

gpt-5

QUARTZ-7392-DELTA

gemini-3-pro

QUARTZ-7392-DELTA

qwen3.6-35b-a3b-nvfp4 (262K cap)

Pure recall test at the 70% depth point of 500K input. Latency includes prompt processing (which is dominant at this scale). Cost based on full 500K tokens in. Verdict 'win' = exact match 'QUARTZ-7392-DELTA'. Models tested at vendor-claimed max context.

2026-05-09 · benchmark

Pick the right tool — multi-step task with 6 available functions

顯示 prompt

You have these tools available: `web_search(query)`, `fetch_url(url)`, `read_file(path)`, `run_shell(cmd)`, `query_database(sql)`, `send_email(to, subject, body)`.

Task: Find the latest published date of Anthropic's Claude 4.7 release announcement, fetch the full text of that page, count how many times the word 'tool' appears, and email the count to ops@example.com with subject 'Claude 4.7 tool count'.

Respond with ONLY a JSON array of tool calls in execution order: [{"tool": "...", "args": {...}}, ...]. No prose, no markdown, no other text.

Model	Latency	成本	評語
claude-opus-4-7	1480ms	$0.013	勝
gpt-5	1320ms	$0.011	平
gemini-3-pro	1810ms	$0.009	負
kimi-k2.6	980ms	$0.001	勝

顯示回答

claude-opus-4-7

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement date site:anthropic.com"}},
  {"tool": "fetch_url", "args": {"url": "<url-from-prior-result>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gpt-5

[
  {"tool": "web_search", "args": {"query": "Claude 4.7 release announcement Anthropic"}},
  {"tool": "fetch_url", "args": {"url": "<top result url>"}},
  {"tool": "run_shell", "args": {"cmd": "echo $TEXT | grep -o tool | wc -l"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gemini-3-pro

I'll help you with that. Here's the plan: [
  {"tool": "web_search", "args": {"query": "Claude 4.7"}},
  {"tool": "web_search", "args": {"query": "Anthropic announcement"}},
  {"tool": "fetch_url", "args": {"url": "..."}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Result", "body": "..."}}
]

kimi-k2.6

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement"}},
  {"tool": "fetch_url", "args": {"url": "<from search>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

Tests whether the model picks the right 4 tools in the right order: web_search → fetch_url → (count locally, no tool) → send_email. Penalize: extra tools, wrong order, missing send_email, prose preamble.

2026-05-02 · benchmark

1000 字の記事を 3 つの bullet に圧縮 — フロンティアモデル対決

顯示 prompt

Anthropic Claude 4.7 リリースに関するこの 1000 字のブログ記事を、ちょうど 3 つの bullet にまとめてください。各 bullet は 25 字以内。bullet のみ返し、前置きは不要。

Model	Latency	成本	評語
claude-opus-4-7	1240ms	$0.022	勝
gpt-5	980ms	$0.018	平
gemini-3-pro	1120ms	$0.014	負

顯示回答

claude-opus-4-7

• Anthropic が Claude 4.7 を発表、1M トークン context がついに Opus グレードに到達。
• 高並列のツール呼び出しで安定性が向上。prompt cache TTL は 5 分のまま維持。
• 早期ユーザーから Sonnet/Haiku の p95 レイテンシが約 12% 低下と報告。Opus の改善は公称で未検証。

gpt-5

• Anthropic が Claude 4.7 を発表、目玉は 1M context が Opus グレードに到達したこと。
• Anthropic は全モデルファミリーで p95 レイテンシが約 12% 改善と主張。
• 並列／高並行シナリオでツール呼び出しの挙動がより安定。

gemini-3-pro

• Anthropic が本日 Claude 4.7 を発表し、複数のアップデートを含む。
• 1M context window が Sonnet と Haiku に加えて Opus グレードでも利用可能となり、長 context のエージェンティックワークフローには大きな変化。
• Anthropic によれば、レイテンシは約 12% 低下し、ツール使用の挙動も改善されている。

同一プロンプト、同一入力。レイテンシは API 呼び出し開始から最終バイトまで。コストは公式 $/Mtok（2026 年 5 月）に基づき計算。