벤치마크

2026-05-09 · benchmark

Fix off-by-one in cursor pagination — agentic code edit head-to-head

顯示 prompt

Here is a Node.js handler for cursor pagination:

```js
function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}
```

It has a subtle off-by-one bug: when called with a cursor, the first item returned is the cursor itself (it should be the item *after* the cursor). Fix the function. Preserve the same signature and return shape. Add a single inline comment only where the bug was. Return only the corrected function, no preamble.

Model	Latency	成本	評語
claude-opus-4-7	2240ms	$0.011	勝
gpt-5	1890ms	$0.009	勝
gemini-3-pro	2050ms	$0.007	平
qwen3.6-35b-a3b-nvfp4	920ms	$0.001	勝

顯示回答

claude-opus-4-7

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // bug fix: skip cursor itself
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gpt-5

function paginate(items, cursor, limit = 20) {
  // bug fix: cursor item itself was being returned; advance past it
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gemini-3-pro

function paginate(items, cursor, limit = 20) {
  let start = 0;
  if (cursor) {
    const idx = items.findIndex(i => i.id === cursor);
    if (idx >= 0) start = idx + 1; // bug fix: cursor was inclusive
  }
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

qwen3.6-35b-a3b-nvfp4

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // off-by-one fix
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

Tests whether the model spots the off-by-one (start should be findIndex(...) + 1) without breaking the cursor=null branch. Latency client-side. Cost from public May 2026 pricing. All models given the same exact prompt.

2026-05-09 · benchmark

JSON extraction from messy meeting notes — schema fidelity test

顯示 prompt

Extract the following from these meeting notes into strict JSON matching this exact schema:

{
  "date": "YYYY-MM-DD",
  "attendees": ["name1", "name2"],
  "action_items": [{"owner": "name", "task": "...", "due": "YYYY-MM-DD or null"}]
}

Notes:
---
Met today (it's the 9th of May 2026, finally Friday) — me, Sarah Chen, Raj P, and Tomás joined late from his car. Sarah said she'd ship the auth migration by next Wed (so 13th?). Raj is still chasing the SSO config from the vendor — no firm date, said "sometime next week". Tomás owes a draft of the launch post but said no rush. I'll handle the customer interview write-up by EOD Monday.
---

Return ONLY the JSON. No preamble, no markdown fences, no trailing comments.

Model	Latency	成本	評語
claude-opus-4-7	1620ms	$0.014	勝
gpt-5	1340ms	$0.012	平
gemini-3-pro	1560ms	$0.008	負
qwen3.6-35b-a3b-nvfp4	720ms	$0.001	平

顯示回答

claude-opus-4-7

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft of the launch post","due":null},{"owner":"me","task":"Customer interview write-up","due":"2026-05-11"}]}

gpt-5

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"narrator","task":"Customer interview write-up","due":"2026-05-12"}]}

gemini-3-pro

```json
{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"SSO config from vendor","due":null}]}
```

qwen3.6-35b-a3b-nvfp4

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"author","task":"Customer interview write-up","due":"2026-05-12"}]}

Tests: date inference (today = 2026-05-09), name normalization (Sarah Chen / Raj P → Raj Patel?? leave as-is), null due dates ('sometime next week' / 'no rush' = null), and adherence to no-prose constraint. Reject responses with markdown fences.

2026-05-09 · benchmark

Needle in a haystack at 500K tokens — long-context recall

顯示 prompt

You will receive a 500,000-token document containing the full text of 12 NeurIPS papers concatenated. Buried at character offset 1,847,392 is a single sentence: 'The secret access code for the May 2026 builder-daily benchmark is QUARTZ-7392-DELTA.'

Question: What is the secret access code? Return only the code itself, nothing else.

Document follows below the marker.

--- DOCUMENT START ---
[~500,000 tokens of NeurIPS paper text]
--- DOCUMENT END ---

Model	Latency	成本	評語
claude-opus-4-7	18420ms	$7.535	勝
gpt-5	22180ms	$6.279	勝
gemini-3-pro	14620ms	$0.628	勝
qwen3.6-35b-a3b-nvfp4 (262K cap)	0ms	$0.000	錯誤

顯示回答

claude-opus-4-7

QUARTZ-7392-DELTA

gpt-5

QUARTZ-7392-DELTA

gemini-3-pro

QUARTZ-7392-DELTA

qwen3.6-35b-a3b-nvfp4 (262K cap)

Pure recall test at the 70% depth point of 500K input. Latency includes prompt processing (which is dominant at this scale). Cost based on full 500K tokens in. Verdict 'win' = exact match 'QUARTZ-7392-DELTA'. Models tested at vendor-claimed max context.

2026-05-09 · benchmark

Pick the right tool — multi-step task with 6 available functions

顯示 prompt

You have these tools available: `web_search(query)`, `fetch_url(url)`, `read_file(path)`, `run_shell(cmd)`, `query_database(sql)`, `send_email(to, subject, body)`.

Task: Find the latest published date of Anthropic's Claude 4.7 release announcement, fetch the full text of that page, count how many times the word 'tool' appears, and email the count to ops@example.com with subject 'Claude 4.7 tool count'.

Respond with ONLY a JSON array of tool calls in execution order: [{"tool": "...", "args": {...}}, ...]. No prose, no markdown, no other text.

Model	Latency	成本	評語
claude-opus-4-7	1480ms	$0.013	勝
gpt-5	1320ms	$0.011	平
gemini-3-pro	1810ms	$0.009	負
kimi-k2.6	980ms	$0.001	勝

顯示回答

claude-opus-4-7

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement date site:anthropic.com"}},
  {"tool": "fetch_url", "args": {"url": "<url-from-prior-result>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gpt-5

[
  {"tool": "web_search", "args": {"query": "Claude 4.7 release announcement Anthropic"}},
  {"tool": "fetch_url", "args": {"url": "<top result url>"}},
  {"tool": "run_shell", "args": {"cmd": "echo $TEXT | grep -o tool | wc -l"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gemini-3-pro

I'll help you with that. Here's the plan: [
  {"tool": "web_search", "args": {"query": "Claude 4.7"}},
  {"tool": "web_search", "args": {"query": "Anthropic announcement"}},
  {"tool": "fetch_url", "args": {"url": "..."}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Result", "body": "..."}}
]

kimi-k2.6

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement"}},
  {"tool": "fetch_url", "args": {"url": "<from search>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

Tests whether the model picks the right 4 tools in the right order: web_search → fetch_url → (count locally, no tool) → send_email. Penalize: extra tools, wrong order, missing send_email, prose preamble.

2026-05-02 · benchmark

1000자 글을 3개 bullet로 압축 — 프런티어 모델 대결

顯示 prompt

Anthropic Claude 4.7 출시에 관한 1000자 분량의 블로그 글을 정확히 3개 bullet로 요약해 주세요. 각 bullet은 25자 이하입니다. 서두 없이 bullet만 답해 주세요.

Model	Latency	成本	評語
claude-opus-4-7	1240ms	$0.022	勝
gpt-5	980ms	$0.018	平
gemini-3-pro	1120ms	$0.014	負

顯示回答

claude-opus-4-7

• Anthropic이 Claude 4.7 공개, 1M 토큰 컨텍스트가 마침내 Opus 등급에 도달.
• 고동시성 도구 호출 환경에서 도구 호출 안정성 향상, prompt cache TTL은 5분 유지.
• 초기 사용자 보고에 따르면 Sonnet/Haiku의 p95 지연이 약 12% 감소, Opus 개선은 발표 기준이며 미검증.

gpt-5

• Anthropic이 Claude 4.7 공개, 핵심은 1M 컨텍스트가 Opus 등급에 도달한 것.
• Anthropic은 모델 패밀리 전반에서 p95 지연이 약 12% 개선됐다고 발표.
• 병렬 및 고동시성 상황에서 도구 호출 동작이 더 안정적임.

gemini-3-pro

• Anthropic이 오늘 Claude 4.7을 공개했으며 여러 업데이트가 포함됨.
• 1M 컨텍스트 윈도우가 이제 Sonnet과 Haiku 외에 Opus 등급에서도 지원되며, 장기 컨텍스트 에이전트 워크플로에 큰 변화임.
• Anthropic에 따르면 지연 시간이 약 12% 감소했고 도구 사용 동작도 개선됐음.

동일한 prompt, 동일한 입력. 지연 시간은 API 호출 시작부터 마지막 byte까지 측정. 비용은 공식 $/Mtok 기준(2026년 5월).