2026-05-24 — 次瀏覽 · 13 models

DGX Spark（GB10）本地模型吞吐量 — 13 組模型/量化/引擎的 prefill 與 decode tok/s

Prompt

在單台 DGX Spark（GB10，128 GB LPDDR5X 統一記憶體，約 273 GB/s 頻寬，約 1 PFLOP FP4）上的標準化單流（batch size 1）推論：2,048 token 輸入、128 token 輸出（ISL/OSL 2048/128）。每一列是一組「模型 + 量化 + 推論引擎」。我們回報提示處理吞吐量（prefill，'pp'）與 token 生成吞吐量（decode，'tg'），單位為 tokens/秒。所示延遲為依已發布 decode 速率生成 128 個 token 的模型化時間（128 / tg × 1000）。

Notes

單台 DGX Spark GB10（128 GB LPDDR5X，273 GB/s）。'pp' = 提示處理 / prefill tok/s；'tg' = token 生成 / decode tok/s。判定分級依單流 decode：win = 30+ tok/s（互動流暢），tie = 10-30（可用），loss = 低於 10（不實用）。每列標註來源等級：NVIDIA-official = developer.nvidia.com「How DGX Spark Performance Enables Intensive AI Tasks」（ISL/OSL 2048/128，BS=1）；community = NVIDIA 開發者論壇 / llama.cpp issues / SGLang 實測。重點：(1) decode 受記憶體頻寬限制 — tg tok/s 約等於每 token 活躍參數位元組 ÷ 273 GB/s，因此 MoE（A3B）與更低位元量化會提升它。(2) prefill 受 Blackwell FP4 核心算力限制 — 不論模型大小通常都有數千 tok/s。(3) 量化格式很重要：NVFP4/MXFP4 的 decode 約為 FP8 的 2 倍（Llama 3.1 8B：38.65 NVFP4 vs 20.5 FP8）。(4) 推測式 MTP 約可讓單流 decode 翻倍（Qwen3.6-27B：13.1 → 28.3），但在並發下會退步。(5) 稠密 70B 在 FP8 下勉強塞進 128 GB 且會抖動（約 2.7 tg）— 單台應避免。(6) 235B 需要兩台 Spark 透過 ConnectX-7 連接。彙整自已發布的基準；除非標註 DUAL，所有數據皆為單台。

Results — 13 models

GPT-OSS-20B · MXFP4 · llama.cpp WIN · 1547ms · in 2048 · out 128

3670.42 pp / 82.74 tg tok/s · llama.cpp · NVIDIA-official

Qwen3.5-35B-A3B · MXFP4 · llama.cpp WIN · 2207ms · out 128

prefill n/p / ~58 tg tok/s · llama.cpp · community (MoE A3B; theoretical ceiling ~91)

GPT-OSS-120B · MXFP4 · llama.cpp WIN · 2312ms · in 2048 · out 128

1725.47 pp / 55.37 tg tok/s · llama.cpp · NVIDIA-official (canonical official 120B decode; engine spread 35 llama.cpp deep-ctx → 41 Ollama → ~50 SGLang)

Qwen2.5-VL-7B · NVFP4 · TRT-LLM (vision) WIN · 3069ms · in 2048 · out 128

65831.77 pp / 41.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · NVFP4 · TRT-LLM WIN · 3312ms · in 2048 · out 128

10256.9 pp / 38.65 tg tok/s · TRT-LLM · NVIDIA-official

Qwen3-Coder-30B-A3B · Q8_0 · llama.cpp WIN · 4129ms · out 128

1308 pp / 31 tg tok/s · llama.cpp · community (llama.cpp #16578; MoE A3B)

Qwen3.6-27B · Q4_K_M +MTP · llama.cpp TIE · 4523ms · out 128

719 pp / 28.3 tg tok/s · llama.cpp +MTP (5 draft) · community (2.16x decode vs no-MTP)

Gemma 4 26B-A4B · F16 · llama.cpp TIE · 4830ms · out 128

prefill n/p / ~26.5 tg tok/s · llama.cpp · community (MoE A4B; theoretical ~34)

Qwen3-14B · NVFP4 · TRT-LLM TIE · 5637ms · in 2048 · out 128

5928.95 pp / 22.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · FP8 · SGLang TIE · 6244ms · out 128

7991 pp / 20.5 tg tok/s · SGLang · community (FP8 decode ~half of NVFP4 — same model)

Qwen3.6-27B · Q4_K_M · llama.cpp TIE · 9771ms · out 128

1084 pp / 13.1 tg tok/s · llama.cpp · community (single-stream, no spec-decode)

Llama 3.1 70B · FP8 · SGLang LOSS · 47407ms · out 128

~803 pp / ~2.7 tg tok/s · SGLang · community (barely fits 128 GB; KV+weights thrash — avoid dense 70B FP8 on one unit)

Qwen3-235B · NVFP4 · TRT-LLM (DUAL Spark) · 10912ms · in 2048 · out 128

23477.03 pp / 11.73 tg tok/s · TRT-LLM · NVIDIA-official · DUAL DGX Spark over ConnectX-7 (does not fit one unit at usable quant)