評測

2026-05-09 · benchmark

修復 cursor 分頁的 off-by-one bug — agentic 程式碼編輯對決

顯示 prompt

以下是一個 cursor 分頁的 Node.js handler：

```js
function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}
```

它有個微妙的 off-by-one bug：當給定 cursor 時，第一個回傳的是 cursor 本身（應該是 cursor *之後* 的 item）。修復這個函式。保留同樣的 signature 與回傳形狀。只在 bug 處加單行 inline 註釋。回傳僅修正後的函式，不要前言。

Model	Latency	成本	評語
claude-opus-4-7	2240ms	$0.011	勝
gpt-5	1890ms	$0.009	勝
gemini-3-pro	2050ms	$0.007	平
qwen3.6-35b-a3b-nvfp4	920ms	$0.001	勝

顯示回答

claude-opus-4-7

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // bug fix: skip cursor itself
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gpt-5

function paginate(items, cursor, limit = 20) {
  // bug fix: cursor item itself was being returned; advance past it
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0;
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

gemini-3-pro

function paginate(items, cursor, limit = 20) {
  let start = 0;
  if (cursor) {
    const idx = items.findIndex(i => i.id === cursor);
    if (idx >= 0) start = idx + 1; // bug fix: cursor was inclusive
  }
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

qwen3.6-35b-a3b-nvfp4

function paginate(items, cursor, limit = 20) {
  const start = cursor ? items.findIndex(i => i.id === cursor) + 1 : 0; // off-by-one fix
  return {
    data: items.slice(start, start + limit),
    next: items[start + limit]?.id ?? null,
  };
}

測試模型是否抓到 off-by-one（start 應為 findIndex(...) + 1），且不破壞 cursor=null 分支。延遲於 client-side 量測。成本依 2026 年 5 月公開定價。所有模型給予完全相同的 prompt。

2026-05-09 · benchmark

從零亂的會議筆記中萃取 JSON — schema 嚴謹度測試

顯示 prompt

從以下會議筆記中萃取資料，產出嚴格符合此 schema 的 JSON：

{
  "date": "YYYY-MM-DD",
  "attendees": ["name1", "name2"],
  "action_items": [{"owner": "name", "task": "...", "due": "YYYY-MM-DD or null"}]
}

筆記：
---
Met today (it's the 9th of May 2026, finally Friday) — me, Sarah Chen, Raj P, and Tomás joined late from his car. Sarah said she'd ship the auth migration by next Wed (so 13th?). Raj is still chasing the SSO config from the vendor — no firm date, said "sometime next week". Tomás owes a draft of the launch post but said no rush. I'll handle the customer interview write-up by EOD Monday.
---

**僅**回傳 JSON。不要前言、不要 markdown 圍欄、不要尾隨註解。

Model	Latency	成本	評語
claude-opus-4-7	1620ms	$0.014	勝
gpt-5	1340ms	$0.012	平
gemini-3-pro	1560ms	$0.008	負
qwen3.6-35b-a3b-nvfp4	720ms	$0.001	平

顯示回答

claude-opus-4-7

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft of the launch post","due":null},{"owner":"me","task":"Customer interview write-up","due":"2026-05-11"}]}

gpt-5

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship the auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"narrator","task":"Customer interview write-up","due":"2026-05-12"}]}

gemini-3-pro

```json
{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"SSO config from vendor","due":null}]}
```

qwen3.6-35b-a3b-nvfp4

{"date":"2026-05-09","attendees":["Sarah Chen","Raj P","Tomás"],"action_items":[{"owner":"Sarah Chen","task":"Ship auth migration","due":"2026-05-13"},{"owner":"Raj P","task":"Chase SSO config from vendor","due":null},{"owner":"Tomás","task":"Draft launch post","due":null},{"owner":"author","task":"Customer interview write-up","due":"2026-05-12"}]}

測試項目：日期推論（today = 2026-05-09）、姓名正規化（Sarah Chen / Raj P → Raj Patel?? 保持原樣）、空值 due（'sometime next week' / 'no rush' = null）、以及對「不要散文」約束的遵循。回應若含 markdown 圍欄即視為失敗。

2026-05-09 · benchmark

50 萬 token 的稻草堆中找針 — 長上下文召回

顯示 prompt

你會收到一份 500,000-token 的文件，內容是 12 篇 NeurIPS 論文全文串接。在字元偏移量 1,847,392 處埋了一句話：'The secret access code for the May 2026 builder-daily benchmark is QUARTZ-7392-DELTA.'

問題：secret access code 是什麼？只回傳代碼本身，其他都不要。

文件接續於下方標記之後。

--- DOCUMENT START ---
[~500,000 tokens of NeurIPS paper text]
--- DOCUMENT END ---

Model	Latency	成本	評語
claude-opus-4-7	18420ms	$7.535	勝
gpt-5	22180ms	$6.279	勝
gemini-3-pro	14620ms	$0.628	勝
qwen3.6-35b-a3b-nvfp4 (262K cap)	0ms	$0.000	錯誤

顯示回答

claude-opus-4-7

QUARTZ-7392-DELTA

gpt-5

QUARTZ-7392-DELTA

gemini-3-pro

QUARTZ-7392-DELTA

qwen3.6-35b-a3b-nvfp4 (262K cap)

純召回測試，位於 500K 輸入的 70% 深度點。延遲含 prompt processing（在此規模下為主導）。成本依 50 萬 token 全量輸入計算。verdict 'win' = 完全比對 'QUARTZ-7392-DELTA'。各模型於官方宣稱之最大上下文進行測試。

2026-05-09 · benchmark

挑對工具 — 6 個函式可選的多步驟任務

顯示 prompt

你有以下工具可用：`web_search(query)`、`fetch_url(url)`、`read_file(path)`、`run_shell(cmd)`、`query_database(sql)`、`send_email(to, subject, body)`。

任務：找出 Anthropic Claude 4.7 釋出公告的最新發布日期，抓取該頁完整文字，計算 'tool' 一字出現幾次，並把次數 email 到 ops@example.com，主旨「Claude 4.7 tool count」。

回應**僅** JSON 陣列，依執行順序的工具呼叫：[{"tool": "...", "args": {...}}, ...]。不要散文、不要 markdown、不要其他文字。

Model	Latency	成本	評語
claude-opus-4-7	1480ms	$0.013	勝
gpt-5	1320ms	$0.011	平
gemini-3-pro	1810ms	$0.009	負
kimi-k2.6	980ms	$0.001	勝

顯示回答

claude-opus-4-7

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement date site:anthropic.com"}},
  {"tool": "fetch_url", "args": {"url": "<url-from-prior-result>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gpt-5

[
  {"tool": "web_search", "args": {"query": "Claude 4.7 release announcement Anthropic"}},
  {"tool": "fetch_url", "args": {"url": "<top result url>"}},
  {"tool": "run_shell", "args": {"cmd": "echo $TEXT | grep -o tool | wc -l"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

gemini-3-pro

I'll help you with that. Here's the plan: [
  {"tool": "web_search", "args": {"query": "Claude 4.7"}},
  {"tool": "web_search", "args": {"query": "Anthropic announcement"}},
  {"tool": "fetch_url", "args": {"url": "..."}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Result", "body": "..."}}
]

kimi-k2.6

[
  {"tool": "web_search", "args": {"query": "Anthropic Claude 4.7 release announcement"}},
  {"tool": "fetch_url", "args": {"url": "<from search>"}},
  {"tool": "send_email", "args": {"to": "ops@example.com", "subject": "Claude 4.7 tool count", "body": "<count>"}}
]

測試模型是否選對 4 個工具的正確順序：web_search → fetch_url →（在地計數，無工具）→ send_email。扣分項：多餘工具、順序錯誤、漏 send_email、有前言。

2026-05-02 · benchmark

把 1000 字文章壓成 3 個 bullet — 前沿模型對決

顯示 prompt

把這篇 1000 字、關於 Anthropic Claude 4.7 發布的部落格，總結成剛好 3 個 bullet，每個 ≤ 25 字。只回 bullet，不要前言。

Model	Latency	成本	評語
claude-opus-4-7	1240ms	$0.022	勝
gpt-5	980ms	$0.018	平
gemini-3-pro	1120ms	$0.014	負

顯示回答

claude-opus-4-7

• Anthropic 發布 Claude 4.7，1M token context 終於來到 Opus 等級。
• 高並發工具呼叫下，工具呼叫穩定度提升；prompt cache TTL 維持 5 分鐘。
• 早期使用者回報 Sonnet/Haiku 的 p95 延遲降約 12%；Opus 改善為宣稱、未驗證。

gpt-5

• Anthropic 發布 Claude 4.7，主打：1M context 來到 Opus 等級。
• Anthropic 宣稱整個模型家族的 p95 延遲約改善 12%。
• 在平行／高並發情境下，工具呼叫行為更穩定。

gemini-3-pro

• Anthropic 今天發布了 Claude 4.7，包含多項更新。
• 1M context window 現在除了 Sonnet 和 Haiku 外，也支援 Opus 等級，這對長 context 的 agentic workflow 是一大改變。
• 根據 Anthropic 的說法，延遲降低了大約 12%，工具使用的行為也有改善。

同 prompt、同輸入。延遲從 API 呼叫開始到最後一個 byte。成本依官方 $/Mtok 計算（2026 年 5 月）。