2026-05-24 — 조회 · 13 models

DGX Spark(GB10) 로컬 모델 처리량 — 13개 모델/양자화/엔진 조합의 prefill·decode tok/s

Prompt

단일 DGX Spark(GB10, 128 GB LPDDR5X 통합 메모리, 약 273 GB/s 대역폭, 약 1 PFLOP FP4)에서의 표준화된 단일 스트림(배치 크기 1) 추론: 입력 2,048 token, 출력 128 token(ISL/OSL 2048/128). 각 행은 '모델 + 양자화 + 추론 엔진' 조합이다. 프롬프트 처리 처리량(prefill, 'pp')과 token 생성 처리량(decode, 'tg')을 tokens/초로 보고한다. 표시된 지연 시간은 공개된 decode 속도로 128개 token을 생성하는 모델링 시간(128 / tg × 1000)이다.

Notes

단일 DGX Spark GB10(128 GB LPDDR5X, 273 GB/s). 'pp' = 프롬프트 처리 / prefill tok/s; 'tg' = token 생성 / decode tok/s. 판정 등급은 단일 스트림 decode 기준: win = 30+ tok/s(쾌적한 대화형), tie = 10-30(사용 가능), loss = 10 미만(비실용적). 각 행에 출처 등급 표기: NVIDIA-official = developer.nvidia.com 'How DGX Spark Performance Enables Intensive AI Tasks'(ISL/OSL 2048/128, BS=1); community = NVIDIA 개발자 포럼 / llama.cpp issues / SGLang 실측. 요점: (1) decode는 메모리 대역폭 제약 — tg tok/s는 대략 token당 활성 파라미터 바이트 ÷ 273 GB/s와 같아 MoE(A3B)와 저비트 양자화가 끌어올린다. (2) prefill은 Blackwell FP4 코어의 연산 제약 — 모델 크기와 무관하게 보통 수천 tok/s. (3) 양자화 형식이 중요: NVFP4/MXFP4의 decode는 FP8의 약 2배(Llama 3.1 8B: 38.65 NVFP4 vs 20.5 FP8). (4) 추측 MTP는 단일 스트림 decode를 약 2배로(Qwen3.6-27B: 13.1 → 28.3) 늘리지만 동시 처리에서는 하락한다. (5) 밀집 70B는 FP8에서 128 GB에 간신히 들어가며 스래싱한다(약 2.7 tg) — 단일 장비에서는 피하라. (6) 235B는 ConnectX-7로 연결된 두 대의 Spark가 필요하다. 공개 벤치마크에서 취합; DUAL로 표기되지 않는 한 모두 단일 장비.

Results — 13 models

GPT-OSS-20B · MXFP4 · llama.cpp WIN · 1547ms · in 2048 · out 128

3670.42 pp / 82.74 tg tok/s · llama.cpp · NVIDIA-official

Qwen3.5-35B-A3B · MXFP4 · llama.cpp WIN · 2207ms · out 128

prefill n/p / ~58 tg tok/s · llama.cpp · community (MoE A3B; theoretical ceiling ~91)

GPT-OSS-120B · MXFP4 · llama.cpp WIN · 2312ms · in 2048 · out 128

1725.47 pp / 55.37 tg tok/s · llama.cpp · NVIDIA-official (canonical official 120B decode; engine spread 35 llama.cpp deep-ctx → 41 Ollama → ~50 SGLang)

Qwen2.5-VL-7B · NVFP4 · TRT-LLM (vision) WIN · 3069ms · in 2048 · out 128

65831.77 pp / 41.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · NVFP4 · TRT-LLM WIN · 3312ms · in 2048 · out 128

10256.9 pp / 38.65 tg tok/s · TRT-LLM · NVIDIA-official

Qwen3-Coder-30B-A3B · Q8_0 · llama.cpp WIN · 4129ms · out 128

1308 pp / 31 tg tok/s · llama.cpp · community (llama.cpp #16578; MoE A3B)

Qwen3.6-27B · Q4_K_M +MTP · llama.cpp TIE · 4523ms · out 128

719 pp / 28.3 tg tok/s · llama.cpp +MTP (5 draft) · community (2.16x decode vs no-MTP)

Gemma 4 26B-A4B · F16 · llama.cpp TIE · 4830ms · out 128

prefill n/p / ~26.5 tg tok/s · llama.cpp · community (MoE A4B; theoretical ~34)

Qwen3-14B · NVFP4 · TRT-LLM TIE · 5637ms · in 2048 · out 128

5928.95 pp / 22.71 tg tok/s · TRT-LLM · NVIDIA-official

Llama 3.1 8B · FP8 · SGLang TIE · 6244ms · out 128

7991 pp / 20.5 tg tok/s · SGLang · community (FP8 decode ~half of NVFP4 — same model)

Qwen3.6-27B · Q4_K_M · llama.cpp TIE · 9771ms · out 128

1084 pp / 13.1 tg tok/s · llama.cpp · community (single-stream, no spec-decode)

Llama 3.1 70B · FP8 · SGLang LOSS · 47407ms · out 128

~803 pp / ~2.7 tg tok/s · SGLang · community (barely fits 128 GB; KV+weights thrash — avoid dense 70B FP8 on one unit)

Qwen3-235B · NVFP4 · TRT-LLM (DUAL Spark) · 10912ms · in 2048 · out 128

23477.03 pp / 11.73 tg tok/s · TRT-LLM · NVIDIA-official · DUAL DGX Spark over ConnectX-7 (does not fit one unit at usable quant)