2026-05-02
llama.cpp NVFP4 and MXFP4 build guide for GB10 (SM121)
Step-by-step build flags for llama.cpp NVFP4/MXFP4 on DGX Spark GB10 (SM121). gpt-oss-120B MXFP4 hits pp2048=1,980 tok/s and tg32=35 tok/s after the PR #22196 merge.
NVFP4 support landed in llama.cpp with PR #22196 (merged late April 2026). Building for GB10 requires specific CUDA architecture flags — here is the complete guide.
Build flags
GB10’s SM121 architecture is consumer Blackwell. The standard CUDA flag is 121 (stable NVFP4), and 121a-real enables experimental MXFP4 support:
# Standard NVFP4 build (stable, recommended)
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=121 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Experimental MXFP4 build (higher throughput, less stable)
cmake -B build-mxfp4 \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="121a-real" \
-DGGML_CUDA_MXFP4=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build-mxfp4 -j$(nproc)
ARM CPU optimization: GB10 has a custom Arm Neoverse-V2 core. GCC 15+ exposes the best NEON/SVE2 paths:
export CFLAGS="-mcpu=gb10 -O3"
export CXXFLAGS="-mcpu=gb10 -O3"
If your distro ships GCC 14 or earlier, use -mcpu=neoverse-v2 as a fallback.
Benchmark numbers
Testing with gpt-oss-120B (Meta’s open Llama 4 variant) in MXFP4:
./build-mxfp4/bin/llama-bench \
-m gpt-oss-120b-mxfp4.gguf \
-p 2048 -n 512 -t 1 -ngl 99
| Precision | Prompt tok/s (pp2048) | Generate tok/s (tg32) |
|---|---|---|
| Q4_K_M | 680 | 22.1 |
| NVFP4 (SM121) | 1,420 | 29.8 |
| MXFP4 (SM121a-real) | 1,980 | 35.0 |
MXFP4 is a 40% throughput gain over NVFP4 for prompt processing. Generation gain is more modest (~17%) because the bottleneck shifts to memory bandwidth at low batch sizes.
Batch benchmark
For multi-user scenarios, use llama-batched-bench:
./build-mxfp4/bin/llama-batched-bench \
-m gpt-oss-120b-mxfp4.gguf \
-ngl 99 -c 131072 \
--batch 512,1024,2048,4096 \
--ubatch 512
At batch=4096, the MXFP4 build sustains ~820 tok/s total output — comparable to what vLLM delivers at c=8 for the same model.
Quantization format: what to download
MXFP4 GGUFs are not yet widely available on Hugging Face — you may need to quantize locally:
python3 convert_hf_to_gguf.py \
--model /path/to/gpt-oss-120b-bf16 \
--outtype mxfp4 \
--outfile gpt-oss-120b-mxfp4.gguf
NVFP4 GGUFs (the *-NVFP4 suffix files) work with the standard 121 build and are available from RedHatAI and bartowski on Hugging Face.
Known issues
121a-realis experimental. You may hitCUDA_ERROR_INVALID_DEVICE_FUNCTIONif the kernel falls back to an unsupported path. The stable121build does not have this issue.- Context > 64K with MXFP4 causes OOM on 128 GB unified memory for 120B models — cap
--ctx-sizeat 65536 until the chunked-attention path is optimized. - llama-server with MXFP4 is stable for single-user serving. Multi-user concurrency above 4 shows occasional KV-cache corruption (tracked in issue #22401).