2026-05-02

llama.cpp NVFP4 and MXFP4 build guide for GB10 (SM121)

Step-by-step build flags for llama.cpp NVFP4/MXFP4 on DGX Spark GB10 (SM121). gpt-oss-120B MXFP4 hits pp2048=1,980 tok/s and tg32=35 tok/s after the PR #22196 merge.

NVFP4 support landed in llama.cpp with PR #22196 (merged late April 2026). Building for GB10 requires specific CUDA architecture flags — here is the complete guide.

Build flags

GB10’s SM121 architecture is consumer Blackwell. The standard CUDA flag is 121 (stable NVFP4), and 121a-real enables experimental MXFP4 support:

# Standard NVFP4 build (stable, recommended)
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Experimental MXFP4 build (higher throughput, less stable)
cmake -B build-mxfp4 \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="121a-real" \
  -DGGML_CUDA_MXFP4=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build-mxfp4 -j$(nproc)

ARM CPU optimization: GB10 has a custom Arm Neoverse-V2 core. GCC 15+ exposes the best NEON/SVE2 paths:

export CFLAGS="-mcpu=gb10 -O3"
export CXXFLAGS="-mcpu=gb10 -O3"

If your distro ships GCC 14 or earlier, use -mcpu=neoverse-v2 as a fallback.

Benchmark numbers

Testing with gpt-oss-120B (Meta’s open Llama 4 variant) in MXFP4:

./build-mxfp4/bin/llama-bench \
  -m gpt-oss-120b-mxfp4.gguf \
  -p 2048 -n 512 -t 1 -ngl 99

Precision	Prompt tok/s (pp2048)	Generate tok/s (tg32)
Q4_K_M	680	22.1
NVFP4 (SM121)	1,420	29.8
MXFP4 (SM121a-real)	1,980	35.0

MXFP4 is a 40% throughput gain over NVFP4 for prompt processing. Generation gain is more modest (~17%) because the bottleneck shifts to memory bandwidth at low batch sizes.

Batch benchmark

For multi-user scenarios, use llama-batched-bench:

./build-mxfp4/bin/llama-batched-bench \
  -m gpt-oss-120b-mxfp4.gguf \
  -ngl 99 -c 131072 \
  --batch 512,1024,2048,4096 \
  --ubatch 512

At batch=4096, the MXFP4 build sustains ~820 tok/s total output — comparable to what vLLM delivers at c=8 for the same model.

Quantization format: what to download

MXFP4 GGUFs are not yet widely available on Hugging Face — you may need to quantize locally:

python3 convert_hf_to_gguf.py \
  --model /path/to/gpt-oss-120b-bf16 \
  --outtype mxfp4 \
  --outfile gpt-oss-120b-mxfp4.gguf

NVFP4 GGUFs (the *-NVFP4 suffix files) work with the standard 121 build and are available from RedHatAI and bartowski on Hugging Face.

Known issues