2026-05-29 — views

A Triton FP8 bypass squeezes 17% more NVFP4 speed out of the DGX Spark's GB10

A community patch reroutes NVFP4 weights through the GB10's FP8 tensor cores instead of a slow BF16 fallback, lifting Qwen3.6-35B-A3B from 40.8 to 47.6 tok/s on the DGX Spark.

NVIDIA markets the DGX Spark as a Blackwell box that natively speaks FP4 — but a community write-up this month exposes a quiet asterisk on that claim, and a clever 80-line workaround that recovers most of the throughput the marketing implies. The short version: the Spark’s GB10 chip cannot actually run FP4 math on its tensor cores, so NVFP4 models had been crawling through a slow software fallback. A Triton monkey-patch reroutes them through the chip’s FP8 cores instead, and Qwen3.6-35B-A3B jumps from 40.8 to 47.6 tok/s.

What the GB10 actually does with FP4

The NVIDIA DGX Spark’s GB10 Superchip (Grace Blackwell, SM121) lacks native FP4 hardware. That matters because vLLM, when handed an NVFP4-quantized model, had been routing the matrix multiplies through a slow BF16 dequantization fallback (the Marlin path) rather than touching the chip’s FP8 tensor cores at all. The 4-bit weights get unpacked back up to 16-bit before the GEMM runs — which throws away most of the point of quantizing in the first place. You save memory, but the compute path is no faster than running a fatter format.

The 80-line Triton bypass

A hands-on write-up (published April 22, updated May 6, 2026) details a roughly 80-line Triton monkey-patch that fixes this without forking vLLM. At model load time the patch converts the NVFP4 weights to FP8, and at inference time it runs the GEMM via torch._scaled_mm on the GB10’s FP8 tensor cores. Because it is a runtime monkey-patch, it requires no changes to vLLM source — you import it, it swaps the relevant kernel, and the existing NVFP4 checkpoint suddenly takes the fast path.

The author also flags a critical correction in the dequantization formula: you must divide by the global scale, not multiply by it, to avoid numerical overflow. It is the kind of one-character sign error that produces garbage or NaNs rather than a clean crash, so it is worth calling out for anyone reproducing the patch.

The numbers

On Qwen3.6-35B-A3B (NVFP4), single-stream generation went from 40.8 tok/s to 47.6 tok/s — a 17% speedup. For context, a native FP8 checkpoint on the same box tops out at 53.8 tok/s, so the patch recovers most — but not all — of the gap between the slow fallback and a purpose-built FP8 model.

Path	Qwen3.6-35B-A3B single-stream
NVFP4 via BF16 fallback (Marlin)	40.8 tok/s
NVFP4 via Triton FP8 bypass	47.6 tok/s (+17%)
Native FP8 checkpoint (ceiling)	53.8 tok/s

The work ran on vLLM 0.19.1 with NVIDIA driver 580.142.

Practitioner note

If you are self-hosting on a Spark and have been reaching for NVFP4 checkpoints because they fit in unified memory, this is worth ten minutes. The honest decision tree: if a native FP8 checkpoint of your model exists and fits, it is still the fastest path at 53.8 tok/s. But FP8 weights are roughly twice the size of FP4, and on a box where every gigabyte of unified memory is contested, that may not be an option for the larger models. In that case the Triton bypass gets you to within ~12% of the FP8 ceiling while keeping the smaller FP4 footprint — a genuinely good trade. Pin your stack to vLLM 0.19.1 and driver 580.142 to match the tested combination, and double-check the divide-not-multiply correction before you trust the outputs.

The under-considered angle

The deeper lesson here is about how to read Blackwell marketing on this hardware. “NVFP4-native Blackwell” is technically true at the format level — the chip understands the encoding — but the GB10’s tensor cores do not execute FP4 math directly, so the 4-bit format only pays off when software bridges it to a path the silicon can actually run fast, like FP8. On a $3K-class box where token/sec and unified-memory headroom are the whole game, an 80-line userland patch with no vLLM fork recovering most of the gap to native FP8 is a pointed reminder: on the Spark, the kernel and runtime layer — not the quant format on the label — is what decides real throughput. The format tells you how small the weights are; it does not tell you how fast they run.