2026-05-09
llama.cpp lands Gemma 4 26B-A4B NVFP4 (b9080) and MiMo-V2.5 attention kernels (b9085)
llama.cpp b9080–b9085 add native Gemma 4 26B-A4B NVFP4 (52 tok/s on Spark, 82 GB free for KV) and MiMo-V2.5 flash-attention tiles for d_kq=192/d_v=128 GQA shapes.
In a 48-hour window between May 8–9, 2026, llama.cpp shipped two highly relevant releases for DGX Spark operators: native NVFP4 support for Google’s Gemma 4 26B-A4B MoE, and flash-attention kernels for Xiaomi’s MiMo-V2.5 (310B sparse MoE).
b9080 — Gemma 4 26B-A4B NVFP4 native
PR #22804 (merged May 8, 18:42 UTC, signed-off by NVIDIA’s ynankani) adds native GGUF conversion for Gemma 4 26B-A4B NVFP4. Architecture: 25.2B total / 3.8B active per token MoE — frontier multimodal capability (image-in, text-out) at sub-A3B compute. Until this PR, community NVFP4 conversions required hand-patching scale tensors, which broke on every model-weight refresh.
Measured on DGX Spark (community benchmark in the PR thread): 52 tok/s single-user, 16.5 GB used, 82 GB free for KV cache. That free-memory number is the headline. With 82 GB available, Gemma 4 26B-A4B can hold a context window beyond 1M tokens at NVFP4 — enough for whole-codebase agents on a single Spark.
b9085 — MiMo-V2.5 attention kernels
PR #22812 (merged May 9, 03:28 UTC) adds flash-attention MMA tiles for d_kq=192 / d_v=128. These are the head dimensions used by Xiaomi’s MiMo-V2.5 (310B sparse MoE, 15B active, 1M context, omnimodal). MiMo-V2.5 itself is too large for a single Spark — but the attention kernel ships independently and is reusable for any model that uses GQA with the same head shape.
In practice: future open-weights models scaling beyond 100B parameters with similar GQA configurations now have a CUDA-optimized fast path on Blackwell consumer hardware without waiting for upstream patches.
Adjacent improvements in the same window
- b9075 — CUDA snake-activation fusion (reduces kernel launches in MoE routers)
- b9066 —
out_prodbatched cuBLAS GEMM (faster training-style outer products, useful for in-context fine-tuning) - b9082 — L2_NORM Hexagon HTP kernel (relevant if you offload draft models to Qualcomm DSP rigs paired with Spark)
The cadence here is notable: NVIDIA staff are actively contributing to llama.cpp main, not just maintaining a downstream fork. NVFP4 fixes that took weeks to bubble up in early 2026 are now landing within days of being filed.
What to do
# Rebuild llama.cpp at b9080 or later
cd llama.cpp && git fetch && git checkout b9085
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 \
-DLLAMA_CURL=OFF -DGGML_NATIVE=ON
cmake --build build --config Release -j
# Pull Gemma 4 26B-A4B NVFP4 from NVIDIA's HF org
huggingface-cli download nvidia/Gemma-4-26B-A4B-NVFP4 \
--local-dir ~/models/gemma4-26b-a4b-nvfp4
# Bench
./build/bin/llama-bench -m ~/models/gemma4-26b-a4b-nvfp4/Gemma-4-26B-A4B-NVFP4.gguf \
--n-gpu-layers 99 --flash-attn -p 512,2048 -n 128
Compare your numbers to the 52 tok/s reference. If you see lower, check that --flash-attn is engaged (look for FA: 1 in the log) and that you’re at b9080+ — pre-b9080 builds use the legacy NVFP4 path that runs ~30–35% slower.