2026-05-24 — views · 13 models

DGX Spark (GB10) local-model throughput — prefill & decode tok/s across 13 model/quant/engine combos

Prompt

Standardized single-stream (batch size 1) inference on one DGX Spark (GB10, 128 GB LPDDR5X unified memory, ~273 GB/s bandwidth, ~1 PFLOP FP4): 2,048-token input, 128-token output (ISL/OSL 2048/128). Each row is a model + quantization + inference engine. We report prompt-processing throughput (prefill, 'pp') and token-generation throughput (decode, 'tg') in tokens/sec. Latency shown is the modeled time to emit 128 tokens at the published decode rate (128 / tg × 1000).

Notes

Single DGX Spark GB10 (128 GB LPDDR5X, 273 GB/s). 'pp' = prompt-processing / prefill tok/s; 'tg' = token-generation / decode tok/s. Verdict tiers by single-stream decode: win = 30+ tok/s (snappy interactive), tie = 10-30 (usable), loss = under 10 (impractical). Source class is tagged per row: NVIDIA-official = developer.nvidia.com 'How DGX Spark Performance Enables Intensive AI Tasks' (ISL/OSL 2048/128, BS=1); community = NVIDIA dev forums / llama.cpp issues / SGLang hands-on. Takeaways: (1) Decode is memory-bandwidth-bound — tg tok/s roughly equals active-param-bytes-per-token divided by 273 GB/s, so MoE (A3B) and lower-bit quant lift it. (2) Prefill is compute-bound on Blackwell FP4 cores — routinely thousands of tok/s regardless of model size. (3) Quant format matters: NVFP4/MXFP4 is about 2x FP8 decode (Llama 3.1 8B: 38.65 NVFP4 vs 20.5 FP8). (4) Speculative MTP roughly doubles single-stream decode (Qwen3.6-27B: 13.1 to 28.3) but regresses under concurrency. (5) Dense 70B at FP8 barely fits 128 GB and thrashes (~2.7 tg) — avoid on a single unit. (6) 235B needs two Sparks over ConnectX-7. Compiled from published benchmarks; all figures single-unit unless flagged DUAL.

Results — 13 models

GPT-OSS-20B · MXFP4 · llama.cpp WIN · 1547ms · in 2048 · out 128

3670.42 pp / 82.74 tg tok/s · llama.cpp · NVIDIA-official

Qwen3.5-35B-A3B · MXFP4 · llama.cpp WIN · 2207ms · out 128

prefill n/p / ~58 tg tok/s · llama.cpp · community (MoE A3B; theoretical ceiling ~91)

GPT-OSS-120B · MXFP4 · llama.cpp WIN · 2312ms · in 2048 · out 128

1725.47 pp / 55.37 tg tok/s · llama.cpp · NVIDIA-official (canonical official 120B decode; engine spread 35 llama.cpp deep-ctx → 41 Ollama → ~50 SGLang)

Qwen2.5-VL-7B · NVFP4 · TRT-LLM (vision) WIN · 3069ms · in 2048 · out 128

65831.77 pp / 41.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · NVFP4 · TRT-LLM WIN · 3312ms · in 2048 · out 128

10256.9 pp / 38.65 tg tok/s · TRT-LLM · NVIDIA-official

Qwen3-Coder-30B-A3B · Q8_0 · llama.cpp WIN · 4129ms · out 128

1308 pp / 31 tg tok/s · llama.cpp · community (llama.cpp #16578; MoE A3B)

Qwen3.6-27B · Q4_K_M +MTP · llama.cpp TIE · 4523ms · out 128

719 pp / 28.3 tg tok/s · llama.cpp +MTP (5 draft) · community (2.16x decode vs no-MTP)

Gemma 4 26B-A4B · F16 · llama.cpp TIE · 4830ms · out 128

prefill n/p / ~26.5 tg tok/s · llama.cpp · community (MoE A4B; theoretical ~34)

Qwen3-14B · NVFP4 · TRT-LLM TIE · 5637ms · in 2048 · out 128

5928.95 pp / 22.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · FP8 · SGLang TIE · 6244ms · out 128

7991 pp / 20.5 tg tok/s · SGLang · community (FP8 decode ~half of NVFP4 — same model)

Qwen3.6-27B · Q4_K_M · llama.cpp TIE · 9771ms · out 128

1084 pp / 13.1 tg tok/s · llama.cpp · community (single-stream, no spec-decode)

Llama 3.1 70B · FP8 · SGLang LOSS · 47407ms · out 128

~803 pp / ~2.7 tg tok/s · SGLang · community (barely fits 128 GB; KV+weights thrash — avoid dense 70B FP8 on one unit)

Qwen3-235B · NVFP4 · TRT-LLM (DUAL Spark) · 10912ms · in 2048 · out 128

23477.03 pp / 11.73 tg tok/s · TRT-LLM · NVIDIA-official · DUAL DGX Spark over ConnectX-7 (does not fit one unit at usable quant)