2026-06-07 — views

vLLM's Official DGX Spark Guide: Why a 120B NVFP4 Model Decodes at ~23 tok/s, and What That Teaches About Bandwidth-Bound Local Inference

On June 1, 2026 the vLLM project published an official guide to running vLLM on NVIDIA's DGX Spark (GB10 Grace Blackwell, sm_121, 128 GB unified memory). Serving Nemotron-3-Super-120B-A12B-NVFP4 from the

What shipped

On June 1, 2026 the vLLM project published an official walkthrough for running vLLM on NVIDIA’s DGX Spark, the desk-side GB10 Grace Blackwell box with 128 GB of unified LPDDR5X memory and the consumer-Blackwell sm_121 compute capability. Unlike most “I got it booting” forum posts, this one pairs a tested deployment recipe with a real local evaluation, so it doubles as a reference for how to think about single-box inference economics.

The headline workload is Nemotron-3-Super-120B-A12B-NVFP4: 120B total parameters, roughly 12B active per token in a mixture-of-experts (MoE) layout, quantized to NVFP4. That combination is the whole point of the exercise. A 120B dense model in BF16 would never fit, but an NVFP4 MoE with about 12B active parameters fits comfortably in 128 GB and only has to stream a fraction of its weights per token.

The numbers, as reported

Served from the vllm/vllm-openai:cu130-nightly container (CUDA 13), with --gpu-memory-utilization 0.85, --max-model-len 131072, and --max-num-seqs 4, the post measured:

Metric	Reported range
Decode throughput	22.7-23.7 tok/s
Prefill throughput	~140 tok/s (58-tok prompt) to ~1,884 tok/s (7,234-tok prompt)
Time-to-first-token	0.42 s (short) to ~3.85 s (long)
KV-cache utilization	under 5% single-user; under 30% small-batch

Two things jump out. First, KV-cache occupancy is tiny, so the 128 GB is almost entirely spent on weights, not context. Second, prefill scales with prompt length while decode sits flat around 23 tok/s regardless. That flat decode line is the tell.

Why decode is flat: bandwidth, not compute

The structural lesson is in the memory bus. GB10 moves data at 273 GB/s over a 256-bit LPDDR5X interface (8,533 MT/s) — roughly 12x slower than an H100’s ~3.35 TB/s HBM3. Autoregressive decode reads the active weights once per token, so at low batch the tensor cores mostly sit idle waiting on memory. Independent teardowns put the same point bluntly: at 273 GB/s a 35 GB model tops out near 7.8 tok/s in theory no matter how much compute you throw at it.

This is also why NVFP4 matters more here than on a datacenter card. Cutting weights from 16 bytes/param (BF16) to about 4.5 bytes/param halves bytes moved per token, which roughly doubles decode throughput on a bandwidth-bound machine. One comparison floating around the ecosystem shows the same Nemotron 3 Super at ~38 tok/s in an NVFP4 build versus ~19.5 tok/s under a Q4_K_M GGUF — same model, the format and kernel path are the variable. (The vLLM post’s own ~23 tok/s reflects its specific 131K context config and container, so treat cross-source numbers as directional, not apples-to-apples.)

Practical framing

The post is refreshingly honest about the envelope: DGX Spark is “best viewed as a local single-user or small-batch inference target.” It deliberately runs --max-num-seqs 4; pushing concurrency higher trades latency away because you are already starved on bandwidth, not compute. Multi-Spark setups over the ConnectX fabric are acknowledged but explicitly not evaluated in this writeup.

The unified-memory architecture is the quiet enabler. Because CPU and GPU share one physical 128 GB pool, weights and KV cache stay resident without cudaMemcpy round-trips across PCIe — you lose raw bandwidth versus HBM but you skip the copy tax, which is what lets a 120B-class model live on a desktop at all.

Practitioner note: If you are speccing a single-box local server for a 100-130B NVFP4 MoE, size your expectations around 20-40 tok/s decode and a low max-num-seqs, not datacenter throughput. Validate the actual container/CUDA combo you will run (the official path here is a cu130-nightly image and a model-specific reasoning parser); decode rates published under different contexts, quant kernels, or serving stacks will not transfer cleanly, and initial weight loading alone can take 10-15 minutes.

Under-considered angle: Almost every DGX Spark benchmark headline is a single-stream decode number, which flatters MoE models — only ~10% of the parameters move per token. The metric that actually predicts whether the box is usable for a small team is decode-under-concurrency with a realistic KV footprint. At under 5% KV-cache use the system is barely exercising its weakest resource; the interesting (and largely unpublished) question is the throughput-vs-latency curve as you raise batch on a 273 GB/s bus, because that is where a desk-side unified-memory box either holds up for a few simultaneous users or collapses into a strictly one-person tool.