2026-05-09
TensorRT-LLM v1.3.0rc14 — Qwen3.5 NVFP4 weight-loading fix lands, Mamba-hybrid prefix caching enabled
TRT-LLM 1.3.0rc14 (May 7) lands the Qwen3.5 NVFP4 weight_scales fix, Mamba-hybrid prefix caching, NVFP4 weight-update, DFlash one-model spec-dec, and a Spark-named GEMM perf PR.
NVIDIA tagged TensorRT-LLM v1.3.0rc14 on May 7, 2026 at 08:55 UTC. It is the first release candidate that simultaneously fixes the long-standing Qwen3.5 NVFP4 weight-loading bug, enables prefix caching for Mamba-hybrid models, and lands NVFP4 weight-update support — three issues that have been blocking real Spark deployments since April.
What ships
| Area | PR | What it does |
|---|---|---|
| Qwen3.5 NVFP4 | #13716 | Preserves weight_scales during checkpoint load — fixes the silent 0-tensor regression in issue #12762 |
| MoE routing | #13433 | Extends customMoeRouting for Qwen3.5 |
| Mamba-hybrid | #12185 | Prefix caching for Qwen3.5 + Nemotron Super V3 hybrids |
| NVFP4 weights | #12320 | NVFP4 weight-update support |
| Spec-dec | #12794 | DFlash one-model speculative decoding |
| Spec-dec | #13453 | Mamba-2 rollback replay (lets spec-dec actually work on hybrids) |
| GEMM | #11589 | GEMM-to-allreduce with registered buffers |
| Dense GEMM | #12074 | CuteDSL bf16 dense GEMMs |
| Spark-specific | #13160 | ”improve gemm perf for nemotron in spark” — directly named for GB10 |
| Eagle3 | #13565 | Acceptance threshold lowered for H20 (implies tighter measurement on small-GPU hosts) |
The release contains 75 contributor changes total. PR #13160 is the headline for Spark operators — it’s the first PR in this release line where DGX Spark appears in the title, signaling NVIDIA is actively profiling for the platform rather than incidentally.
What it unblocks
Qwen3.5 NVFP4 in production. Issue #12762 had been open since April with users reporting 0-tensor outputs after weight load. rc14 closes it. Anyone running Qwen3.5-30B-A3B on TRT-LLM should pull this container before deploying.
Mamba-hybrid agentic loops. Prefix caching on Nemotron Super V3 / Qwen3.5 hybrid SSMs means follow-up turns in agent sessions stop paying the full prefill cost. For multi-turn coding agents and long-conversation chatbots, the TTFT improvement on cached turns is order-of-magnitude.
Spec-dec on hybrids. PR #13453’s Mamba-2 rollback replay is what makes speculative decoding actually safe on hybrid models — previously rejected drafts could leave the SSM state in a corrupted position. With rollback replay, spec-dec on Qwen3.5 hybrids becomes a viable speedup lever.
What to do
- Pull the rc14 container:
docker pull nvcr.io/nvidia/tensorrt-llm:1.3.0rc14-py3 - Rebuild your Qwen3.5-30B-A3B NVFP4 checkpoint — old checkpoints loaded with the broken code path may have stale weight_scales baked in.
- Re-test Mamba-hybrid models with
--enable_prefix_caching. Measure TTFT on cached vs cold first-turn requests; the gap is the new latency budget for downstream agent steps. - If you’re on Eagle3 spec-dec, re-tune your acceptance threshold per #13565’s H20 reference.