2026-05-09

TensorRT-LLM v1.3.0rc14 — Qwen3.5 NVFP4 weight-loading fix lands, Mamba-hybrid prefix caching enabled

TRT-LLM 1.3.0rc14 (May 7) lands the Qwen3.5 NVFP4 weight_scales fix, Mamba-hybrid prefix caching, NVFP4 weight-update, DFlash one-model spec-dec, and a Spark-named GEMM perf PR.

NVIDIA tagged TensorRT-LLM v1.3.0rc14 on May 7, 2026 at 08:55 UTC. It is the first release candidate that simultaneously fixes the long-standing Qwen3.5 NVFP4 weight-loading bug, enables prefix caching for Mamba-hybrid models, and lands NVFP4 weight-update support — three issues that have been blocking real Spark deployments since April.

What ships

Area	PR	What it does
Qwen3.5 NVFP4	#13716	Preserves `weight_scales` during checkpoint load — fixes the silent 0-tensor regression in issue #12762
MoE routing	#13433	Extends `customMoeRouting` for Qwen3.5
Mamba-hybrid	#12185	Prefix caching for Qwen3.5 + Nemotron Super V3 hybrids
NVFP4 weights	#12320	NVFP4 weight-update support
Spec-dec	#12794	DFlash one-model speculative decoding
Spec-dec	#13453	Mamba-2 rollback replay (lets spec-dec actually work on hybrids)
GEMM	#11589	GEMM-to-allreduce with registered buffers
Dense GEMM	#12074	CuteDSL bf16 dense GEMMs
Spark-specific	#13160	”improve gemm perf for nemotron in spark” — directly named for GB10
Eagle3	#13565	Acceptance threshold lowered for H20 (implies tighter measurement on small-GPU hosts)

The release contains 75 contributor changes total. PR #13160 is the headline for Spark operators — it’s the first PR in this release line where DGX Spark appears in the title, signaling NVIDIA is actively profiling for the platform rather than incidentally.

What it unblocks

Qwen3.5 NVFP4 in production. Issue #12762 had been open since April with users reporting 0-tensor outputs after weight load. rc14 closes it. Anyone running Qwen3.5-30B-A3B on TRT-LLM should pull this container before deploying.

Mamba-hybrid agentic loops. Prefix caching on Nemotron Super V3 / Qwen3.5 hybrid SSMs means follow-up turns in agent sessions stop paying the full prefill cost. For multi-turn coding agents and long-conversation chatbots, the TTFT improvement on cached turns is order-of-magnitude.

Spec-dec on hybrids. PR #13453’s Mamba-2 rollback replay is what makes speculative decoding actually safe on hybrid models — previously rejected drafts could leave the SSM state in a corrupted position. With rollback replay, spec-dec on Qwen3.5 hybrids becomes a viable speedup lever.

What to do

Pull the rc14 container: docker pull nvcr.io/nvidia/tensorrt-llm:1.3.0rc14-py3
Rebuild your Qwen3.5-30B-A3B NVFP4 checkpoint — old checkpoints loaded with the broken code path may have stale weight_scales baked in.
Re-test Mamba-hybrid models with --enable_prefix_caching. Measure TTFT on cached vs cold first-turn requests; the gap is the new latency budget for downstream agent steps.
If you’re on Eagle3 spec-dec, re-tune your acceptance threshold per #13565’s H20 reference.

TensorRT-LLM v1.3.0rc14 — Qwen3.5 NVFP4 weight-loading fix lands, Mamba-hybrid prefix caching enabled

What ships

What it unblocks

What to do

Sources