2026-06-10 — views

llama.cpp b9555 Ships Native NVFP4 Kernels for Blackwell SM121, Unlocking Full DGX Spark Performance

llama.cpp b9555 ships native NVFP4 GEMM kernels for Blackwell SM121/GB10 — the first build to bypass the FP16-compute fallback, with an estimated 30–40% single-user decode throughput gain on DGX Spark.

The release

Build b9555 of llama.cpp, released June 8, 2026 on the ggml-org GitHub repository, is the release DGX Spark users have been waiting for. For the first time, the CUDA backend ships with native NVFP4 matrix-multiply kernels compiled for Blackwell SM121 — the compute architecture inside the GB10 chip that powers every DGX Spark. Until now, running NVFP4-quantized models on the Spark required either TensorRT-LLM (fast but operationally heavy) or vLLM (high-throughput but with single-user overhead). llama.cpp’s lightweight, single-binary deployment model now has the hardware acceleration to match.

Why NVFP4 matters on GB10

The Grace Blackwell GB10 SoC has two fundamental advantages over prior-generation hardware: a 900 GB/s bidirectional NVLink-C2C connection between the Grace CPU and the Blackwell GPU, and native tensor core support for sub-8-bit formats including NVFP4. For inference workloads, NVFP4 halves the memory footprint of the already-compact FP8 representation, which translates directly to model capacity per device.

A Qwen3-30B model in FP16 fills roughly 60 GB of the Spark’s 128 GB unified memory; in NVFP4 it lands around 15 GB, leaving ample room for a 128K-token KV cache without spilling to system RAM. For MoE architectures like Qwen3-235B — which require a dual-Spark setup — NVFP4 is not a nice-to-have: it is the difference between fitting the active expert layers in GPU SRAM or constantly evicting them.

What b9555 actually changed

Prior to b9555, llama.cpp’s CUDA backend could load NVFP4-quantized GGUF files on Blackwell hardware, but the matrix-multiply operations fell back to a software dequantize-then-multiply path. The result was correct output but no tensor core utilisation — running NVFP4 weights at FP16 compute speed, which negates the bandwidth savings at the kernel level.

The PR merged in b9555 wires NVFP4 inputs directly into Blackwell’s blockscaled GEMM tensor core path, the same path exposed by NVIDIA CUTLASS 4.x and TensorRT-LLM’s SM121 kernels. The implementation handles NVFP4 tensor names and their corresponding scale-factor tensors across both dense and expert (MoE) layers — a detail that earlier experimental patches had not fully resolved for MoE models.

The SM121-specific code shares the bulk of its kernel logic with SM120 (the consumer RTX 5090 Blackwell variant) but adds the NVLink-aware memory layout optimisations specific to the GB10’s unified memory architecture. Benchmarks from the DGX Spark community using earlier experimental builds showed that SM120 kernels running on SM121 hardware left 15–20% of peak throughput on the table due to misaligned memory access patterns.

Expected performance impact

Using the prior fallback path, Llama-4-Scout-17B in NVFP4 was achieving roughly 45–50 tokens/s decode on a single DGX Spark in single-user mode, based on data from the Spark Arena leaderboard and NVIDIA Developer Forums benchmark threads. The SM121 native kernel path is expected to close the gap toward the TensorRT-LLM reference figure of around 65–70 tokens/s for the same model and quantisation — a 30–40% throughput gain without any change to the serving stack or model weights.

For Qwen3-30B, a common local deployment target on the Spark, the bandwidth-bound decode path should see a similar uplift. NVFP4 on Blackwell in single-user mode is not compute-bound — it is memory-bandwidth-limited — and the native kernel reduces the number of memory transactions per token by approximately 2x relative to the FP16 fallback compute path. At batch size 1, this maps directly to roughly 2x the sustainable decode rate, assuming the KV cache fits in the GPU’s fast memory pool.

Practical implications for DGX Spark deployments

For teams running local inference on a DGX Spark, b9555 makes llama.cpp a first-class option for NVFP4 models rather than a fallback. The framework choice calculus has historically been: llama.cpp for single-user interactive workloads (lower latency, simpler setup, no tokeniser server overhead) and vLLM for concurrent multi-user or batch workloads (continuous batching, multi-modal pipeline support). That division remains true after b9555, but the performance parity with vLLM’s NVFP4 path in single-user scenarios is now closer than it has been at any point in 2026.

For llama-benchy users — the community tool providing llama-bench style numbers across vLLM, SGLang, and llama.cpp in a unified harness — b9555 is worth a re-run with the --backend llama.cpp --quant nvfp4 configuration.

Caveats

Two limitations are worth noting. First, b9555’s NVFP4 support is limited to inference; training and fine-tuning in NVFP4 remain out of scope for llama.cpp. Second, the SM121 path is currently validated against dense models and MoE architectures with up to 64 expert slots. Very large MoE configurations — such as the full Qwen3-235B routing tables — may require additional tuning passes.

The bottom line: if you are running NVFP4 GGUF models on a DGX Spark and have been pinned to llama.cpp for operational simplicity, update to b9555 and re-run your benchmarks.