2026-06-08 — views
TensorRT-LLM rc17 lands an NVFP4 MoE backend and NVFP4 KV cache for SM121 (DGX Spark)
TensorRT-LLM v1.3.0rc17 (June 2) adds a FlashInfer NVFP4 MoE backend gated for SM120/SM121, enables NVFP4 KV cache in trtllm-gen attention, and fixes a qwen3 SM120/121 hang — consumer-Blackwell wins for DGX Spark's GB10.
What shipped
NVIDIA tagged TensorRT-LLM v1.3.0rc17 on June 2, 2026. Buried in a long changelog are two entries that matter specifically for people running large models on a single Grace Blackwell box rather than a datacenter rack.
The first is a new feature: “Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron” (PR #13773). The second: “Enable NVFP4 KV cache support in trtllm-gen attention” (PR #12544). There is also a bug fix that quietly confirms who this is for — “Fix qwen3 hang on SM120/121” (PR #14424).
Why SM121 is the headline
SM120 and SM121 are the consumer-Blackwell compute capabilities. SM120 is the RTX 50-series; SM121 is the GB10 in DGX Spark. They are not the same as datacenter Blackwell (SM100): the tensor-core programming model on SM12x is closer to Ampere’s mma.sync than to datacenter Blackwell’s tcgen05, so kernels compiled for the datacenter part do not run on GB10 until they are rebuilt for SM121.
That gap is exactly why a backend “gated for SM120/SM121” is news. A FlashInfer NVFP4 mixture-of-experts path that is explicitly compiled for these targets means Nemotron-class MoE models can use the FP4 tensor cores on a DGX Spark instead of falling back to a slower generic path. Owners on the NVIDIA forums have spent weeks asking for an official SM121 software roadmap; rc17 is a piece of that filling in.
The KV-cache half of the story
The NVFP4 KV cache entry is the other lever. Per NVIDIA’s own engineering write-up, an NVFP4 KV cache cuts cache memory footprint by up to 50% versus FP8, with under 1% accuracy loss across the benchmarks they published (for example, on Qwen3-480B-A35B: MMLU-PRO 77.4% vs 78.1% for FP8, Ruler 64K 94.6% vs 95.5%). Values are dequantized from NVFP4 to FP8 before the attention math runs.
| Lever in rc17 | What it buys |
|---|---|
| FlashInfer NVFP4 MoE (SM120/SM121) | FP4 expert kernels that actually compile for GB10 / RTX 50 |
| NVFP4 KV cache (trtllm-gen attention) | About half the KV memory vs FP8; room to double context or batch |
| qwen3 SM120/121 hang fix | Removes a hard blocker for Qwen3 on consumer Blackwell |
On a 128GB unified-memory part, halving KV-cache bytes is not a microbenchmark flex — it is the difference between a long-context session fitting or thrashing. NVIDIA reports the same NVFP4 KV cache enabling up to roughly doubled context length and batch size and up to 3x better time-to-first-token in their large-scale numbers, though those headline figures come from datacenter Blackwell, not a measured GB10 run.
Practitioner note
This is a release candidate (rc17), and the release itself flags a known issue: DeepSeek V3.2 can crash with an illegal-memory-access during long agg/disagg perf tests. If you pull rc17 onto a DGX Spark to try the FP4 MoE path, treat it as evaluation, not production — pin the exact build, run your own accuracy spot-check before trusting the KV-cache quantization on your workload, and note that the bundled flashinfer-python is itself a release candidate (bumped to 0.6.12rc2). The NVFP4 KV cache also needs a model quantized with the right recipe (post-training or quantization-aware via the Model Optimizer); it is not a runtime flag you flip on an arbitrary FP16 checkpoint.
Under-considered angle
The quiet story is that NVFP4 enablement is migrating down the stack from “runs on datacenter Blackwell” to “compiles for the chip in your office.” Most published NVFP4 numbers — the 50% KV savings, the 3x TTFT, the accuracy tables — were measured on SM100 datacenter parts, yet the SM12x instruction set is genuinely different. So the interesting open question for DGX owners is not whether NVFP4 helps in principle, but how much of the datacenter benefit actually survives the recompile to SM121, where the tensor-core path looks more like Ampere. rc17 gives the kernels; the honest GB10-measured deltas are still owed.