2026-06-06 — views

Gemma 4 Multi-Token Prediction Lands in llama.cpp: Self-Speculative Decoding Goes Mainstream for Local Inference

llama.cpp merged native Multi-Token Prediction (MTP) speculative decoding in May 2026 (PR #22673), reporting roughly 2.4x faster single-stream generation on Qwen3.6-27B at about 72% draft acceptance, with the draft head

What shipped

llama.cpp now has native support for Multi-Token Prediction (MTP) speculative decoding. The core infrastructure landed in PR #22673 (“llama + spec: MTP Support”), opened May 4, 2026 and merged May 16, 2026. On June 6, 2026 the follow-on Gemma 4 MTP work (PR #23398) was marked ready for review, extending the same machinery to Google’s Gemma 4 family.

The headline for local users is single-stream latency. Traditional speculative decoding speeds up generation by running a small separate draft model to guess the next few tokens, which the big model then verifies in one pass. MTP folds that idea into the model artifact itself: a lightweight prediction head proposes several tokens per forward pass, so you get the speculative-decoding speedup without sourcing, sizing, and aligning a separate draft model.

The numbers, from the merge

On the merged Qwen3.6-27B path, PR #22673 reports about a 2.4x wall-clock speedup with 3 draft tokens (83.8 seconds versus 201 baseline) at a 72.18% acceptance rate, and roughly 2.2x with 2 draft tokens at 82.58% acceptance. It was also validated on the Qwen3.6-35BA3B MoE variant. The design loads the MTP head from the same GGUF file, so nothing extra has to be distributed, and it keeps its own context and KV-cache.

The Gemma 4 PR pushes this onto a different lineage. It reports over 2x on the dense 31B model, with acceptance ranging from about 43% to 70% depending on configuration, and one example moving from roughly 40 tok/s to about 100 tok/s. Q8-quantized runs land around 1.74x to 1.97x. The PR covers the 31B and 26B-A4B variants and excludes the smaller E4B and E2B.

Model / PR	Draft tokens	Acceptance	Speedup
Qwen3.6-27B (PR #22673, merged)	3	~72%	~2.4x
Qwen3.6-27B (PR #22673, merged)	2	~83%	~2.2x
Gemma 4 dense 31B (PR #23398, in review)	varies	~43-70%	over 2x
Gemma 4 31B Q8 (PR #23398, in review)	varies	varies	~1.74-1.97x

One subtlety: Gemma does it differently

There is an architectural fork worth knowing. The Qwen-style path uses an MTP head packaged inside the same weights. Gemma 4, by contrast, ships separate Google-trained “assistant” / drafter models (the Gemma4AssistantForCausalLM class) aligned to Gemma 4’s own output distribution, plus new scaling tensors that needed custom mappings in the loader. Both approaches chase the same goal, a high acceptance rate so the verifier rarely rejects, but the plumbing and the files you fetch differ. A separate libllama MTP API (PR #18886) is still in draft, so the public C API for this is not finalized yet even though the server path is usable.

Why it matters for local rigs

Acceptance rate is the whole game, and it is workload-dependent. Predictable text such as code and structured output accepts at the high end of these ranges; free-form prose accepts less, and the realized speedup falls accordingly. Community reports outside the PRs cluster nearer 1.7x to 1.9x for short, varied generations, which is the honest expectation for interactive chat rather than batch code completion. The win is real but it is a latency win for one stream at a time, not a throughput win for a busy multi-user server, where continuous batching already saturates the GPU.

Practitioner note

If you run a single-user coding assistant locally, this is the cheapest speedup available right now: pull a recent llama.cpp build, use an MTP-capable GGUF (Qwen3.6 today, Gemma 4 once #23398 lands), and start with 2-3 draft tokens. Watch the reported acceptance rate as your real benchmark, not the marketing multiplier, and tune draft-token count to your prompts; too many drafts on low-acceptance prose can erase the gain. Verify your build actually exposes the MTP server flags before assuming it is active, since the feature is recent and the C API is still in flux.

Under-considered angle

Everyone quotes the speedup, but the strategic shift is who owns the draft model. With separately trained drafters like Gemma 4’s assistant variants, the model vendor controls acceptance quality, which can quietly become a moat: a first-party drafter aligned to the exact output distribution should beat any community-bolted generic small model. That centralizes an optimization that used to be a community tinkering space, and it raises a quieter question for self-hosters, namely whether a vendor could ship a strong base model with a deliberately mediocre drafter and keep the fast path behind a different license or release cadence.