2026-05-24 — views

llama.cpp merges native MTP speculative decoding — ~2.16× single-stream decode on Qwen3.6 for DGX Spark

PR #22673 lands native multi-token-prediction speculative decoding in llama.cpp (build b9180+). On a GB10 DGX Spark, Qwen3.6-27B Q4_K_M jumps 13.1 to 28.3 tok/s single-request — but regresses under concurrency.

A feature that home and solo self-hosters have been waiting on landed upstream this month: native multi-token-prediction (MTP) speculative decoding merged into llama.cpp via PR #22673 (“llama + spec: MTP Support”, by am17an) on 2026-05-16, shipping in build b9180 and newer. On a GB10 DGX Spark it roughly doubles single-request decode throughput on Qwen3.6 — with one important caveat that flips the value proposition under load.

What MTP speculative decoding is

Speculative decoding speeds up token generation by drafting several candidate tokens cheaply, then verifying them in a single forward pass. The classic approach needs a separate small draft model running alongside the big one — extra memory, extra setup, and a quality-match headache.

MTP removes the second model. Qwen3.6 ships with native multi-token-prediction heads: the model already predicts several future tokens as part of its architecture. llama.cpp’s new --spec-type draft-mtp mode uses those built-in heads as the draft source, so the same weights both speculate and verify. No draft model to source, no mismatch risk, and the drafts are higher quality because they come from the target model itself.

Two tunables control the aggressiveness:

--spec-draft-n-max — how many tokens to draft per step (5 is the sweet spot in the benchmarks below)
--spec-draft-p-min — the minimum acceptance probability before a drafted token is kept

The numbers — GB10 DGX Spark

On the NVIDIA Developer Forums, a community benchmark (dated 2026-05-15) ran Qwen3.6-27B dense, Q4_K_M on a DGX Spark:

Scenario	Without MTP	With MTP (5 draft)	Change
Single request	13.1 tok/s	28.3 tok/s	+2.16×
4 concurrent requests	41.5 tok/s	29.9 tok/s	−28%

The single-stream win is real and large. But notice the second row: under four concurrent requests, MTP hurts aggregate throughput. That is not a bug — it is the fundamental trade-off of speculative decoding.

The gotcha: latency vs throughput

Speculative decoding trades spare compute for lower latency. When you serve one request at a time, the GB10’s tensor cores are idle most of the decode loop (decode is memory-bandwidth-bound on Spark’s 273 GB/s LPDDR5X), so drafting extra tokens is nearly free and you get the 2× speedup.

Under batching, the opposite is true: concurrent requests already saturate the compute, so the speculative drafts compete for cycles and the wasted work on rejected tokens drags aggregate throughput down. This makes MTP a killer feature for single-user, interactive self-hosting — and the wrong default for a multi-user serving box. If your DGX Spark is your personal coding/assistant endpoint, turn it on; if it is fronting several teammates, leave it off.

Reproduces across hardware

The effect is not Spark-specific. A cross-platform writeup on an RTX 3090 measured Qwen3.6-27B at 38 → 65 tok/s (1.71×) with no quality loss, and confirmed it on Qwen3.6-35B-A3B as well. MTP-enabled GGUFs are already on Hugging Face (for example froggeric/Qwen3.6-27B-MTP-GGUF), so you do not need to convert weights yourself — pull an MTP build, grab an MTP GGUF, and add the --spec-type draft-mtp flag.

Companion development: TensorRT-LLM v1.3.0rc15

For the production-inference side of the Spark ecosystem, NVIDIA shipped TensorRT-LLM v1.3.0rc15 on 2026-05-21 (the project is on a roughly weekly rc cadence — rc14 was 2026-05-07). Highlights relevant to GB10 (which is SM 12.1):

Gemma4 support with text, vision, audio, and chunked-prefill — a new multimodal family for Blackwell inference.
INT4-AWQ kernels for SM120/121, directly covering Spark-class hardware.
Broadened NVFP4 / MXFP4 MoE backends (MegaMoE DeepGEMM, CUTEDSL MoE for Nemotron-H, W4A8_MXFP4_FP8) plus FP4/FP8 decode-kernel indexing optimizations.

The two tracks complement each other: llama.cpp MTP is the path of least resistance for solo interactive use today, while TensorRT-LLM is where the quantized-MoE and multimodal serving performance lives as it matures on Blackwell.

Takeaway

If you run a DGX Spark as a personal LLM endpoint, the MTP merge is the single highest-leverage update this month: a build bump plus one flag for a ~2× interactive speedup on Qwen3.6, no draft model required. Just remember it is a single-stream optimization — benchmark your own concurrency level before enabling it on a shared box.