2026-06-08 — views

Why DGX Spark's GB10 reads as 23 tok/s decode but 1,884 tok/s prefill: a bandwidth-budget breakdown

A verified breakdown of vLLM's June 2026 DGX Spark deployment: a 120B NVFP4 MoE model decodes at ~23 tok/s but prefills at ~1,884 tok/s, and the GB10's 273 GB/s memory bandwidth explains the gap.

What shipped

On June 1, 2026 the vLLM project published a deep-dive on running large language models on a single NVIDIA DGX Spark, the GB10 Grace Blackwell desktop box, and it did something most “first impressions” posts skip: it reported the full latency-throughput shape of one realistic deployment instead of a single hero number. The model under test was Nemotron-3-Super-120B-A12B-NVFP4 — a 120B-parameter mixture-of-experts (MoE) checkpoint quantized to NVFP4, with roughly 10-15B parameters active per token, served at a 131,072-token max context inside the box’s 128 GB unified CPU+GPU memory pool.

That single configuration is a clean teaching case for how local Blackwell inference actually behaves, because it separates the two phases of an inference request — prefill (reading your prompt) and decode (writing the answer one token at a time) — and shows they live on completely different performance curves.

The numbers, and which phase they belong to

Here are the figures the vLLM post reported for that deployment, plus the GB10 hardware ceiling they sit under (from NVIDIA’s launch detailing of the chip) and a community concurrency study that fills in the batched picture.

Metric	Figure	Source
Decode throughput (single stream)	22.7–23.7 tok/s	vLLM, Nemotron-3-Super-120B-A12B-NVFP4
Prefill, short judge call (58-token prompt)	~140 tok/s, TTFT 0.42 s	vLLM
Prefill, medium prompt (1,834 tokens)	1,636 tok/s, TTFT 1.12 s	vLLM
Prefill, long prompt (7,234 tokens)	~1,884 tok/s, TTFT 3.85 s	vLLM
Max context tested	131,072 tokens	vLLM
KV-cache utilization under demo traffic	below 30%	vLLM
GB10 memory bandwidth	273–301 GB/s	The Register (GB10 detailing)
GB10 memory	128 GB LPDDR5x, 256-bit bus, 9,400 MT/s	The Register
GB10 peak FP4 compute	~1 petaFLOP	The Register

Read the table top to bottom and the story writes itself. Prefill runs at 1,600-1,900 tok/s on long prompts — over 70x the decode rate. Decode crawls along at ~23 tok/s no matter how long the prompt was. The vLLM authors note decode “is still shaped by the active parameter count” and stays flat across prompt sizes. That flatness is the tell.

The mechanism: decode is bandwidth-bound, prefill is compute-bound

Autoregressive decode generates one token per forward pass. For each token, the engine must stream the model’s active weights out of memory and into the Tensor Cores. With ~12B active NVFP4 parameters at ~0.5 bytes each, that is roughly 6 GB of weight reads per token (plus a small KV-cache read). At the GB10’s ~273 GB/s floor, 6 GB takes about 22 ms, which caps you near 45 tok/s in theory and lands at ~23 tok/s once you add KV reads, MoE routing, and framework overhead. The ~1 PFLOP of FP4 compute is nearly idle during decode — the Tensor Cores spend most of their time waiting for memory. This is why decode speed barely moves whether your prompt was 58 tokens or 7,234: the per-token weight read is identical.

Prefill is the opposite. It processes every prompt token in parallel as one big matrix multiply, so it saturates the FP4 Tensor Cores instead of the memory bus. That is why prefill hits four-figure tok/s and why TTFT scales with prompt length while decode does not.

The MoE design is what makes a 120B model usable here at all. A dense 120B model in NVFP4 would need ~60 GB of weight reads per token and decode in low single digits. By activating only ~12B parameters per token, the MoE cuts the per-token bandwidth bill roughly 5x — trading abundant unified memory capacity (you store all 120B params) for scarce memory bandwidth (you only read 12B per step). That is the central design move of local Blackwell inference: capacity is cheap, bandwidth is the budget.

Batching changes the calculus

Single-stream decode looks slow, but a separate concurrency benchmark (Dendro Logic, April 22, 2026) shows what happens when you batch. On a single DGX Spark, Nemotron Super 49B v1.5 NVFP4 went from 5.79 tok/s aggregate at one stream to 161.90 tok/s at 32 streams and 695.11 tok/s at 256 streams — a 120x aggregate gain — while per-sequence speed fell from 5.79 to 2.85 tok/s. OpenAI’s gpt-oss 120B in MXFP4 showed the same shape: 33.53 tok/s single-stream rising to 862.84 tok/s aggregate at 256 streams. The post’s framing is exact: “Memory bandwidth is a budget you spend… when you run two streams simultaneously, you spend the same bandwidth budget reading the same weights, and both streams get the result.” You load the weights once and amortize that read across the whole batch, so aggregate throughput climbs until KV-cache memory or compute finally saturates.

The software stack has been the other lever. NVIDIA reported on January 5, 2026 that NVFP4 plus speculative decoding delivered up to a 2.6x gain over FP8 on Qwen-235B, that NVFP4 cuts memory use ~40%, and that llama.cpp updates added an average 35% uplift on MoE models — all on the same silicon.

Practitioner note

If I were deploying an assistant or an agent loop on one of these boxes, I would stop quoting single-stream decode as “the speed” and instead design around the two curves. For an interactive single user, ~23 tok/s is fine for chat but painful for long generations, so I would lean on speculative decoding and keep outputs short. For anything serving more than one caller — a small team, a batch summarization job, an LLM-judge pipeline scoring many candidates — I would run concurrency 16-32 and treat the box as a throughput engine, because that is where the bandwidth budget actually pays off (45 to 162 tok/s aggregate in the data above). I would default to NVFP4 MoE checkpoints, size the model so active parameters fit comfortably in the bandwidth budget rather than maxing total parameters, and I would benchmark prefill and decode separately on my own prompt-length distribution before trusting anyone’s headline tok/s — including mine.

Under-considered angle

Everyone benchmarks decode tok/s; almost nobody budgets prefill against agent loops. An agentic workflow that re-reads a growing 7K-token scratchpad on every step pays the ~3.85 s TTFT each turn, and across a 20-step loop that prefill cost can dwarf the actual generation. On a bandwidth-constrained box, prompt-caching and KV reuse are not a nice-to-have optimization — they are the difference between a usable local agent and one that spends most of its wall-clock re-reading its own context. The unified-memory design that makes 120B models loadable is the same design that makes wasted prefill expensive, because there is no spare bandwidth to hide it behind.