2026-05-01

vLLM vs llama.cpp vs Ollama on DGX Spark — which inference stack to use

Decision guide for inference stacks on GB10: vLLM wins for MoE+concurrency, llama.cpp for MXFP4 prompts and single-user, Ollama for zero-config dev. Includes NVFP4 tok/s comparison.

Three inference stacks dominate DGX Spark deployments: vLLM, llama.cpp, and Ollama. They have meaningfully different tradeoffs on GB10’s SM121 architecture.

At a glance

	vLLM	llama.cpp	Ollama
NVFP4 support	✅ (cu130-nightly)	✅ (PR #22196)	⚠️ via llama.cpp backend
MoE models	✅ best	✅ good	✅ good
Multi-user concurrency	✅ excellent	⚠️ limited	⚠️ limited
MTP speculative decoding	✅	❌	❌
Setup complexity	High	Medium	Low
OpenAI-compatible API	✅	✅ (llama-server)	✅

Single-user throughput: Qwen3.6-35B-A3B

Stack	Quantization	Single-user tok/s
vLLM (FP8, no MTP)	FP8	28–33
vLLM (NVFP4, no MTP)	NVFP4	~42
vLLM (NVFP4 + MTP-1)	NVFP4	55.9
llama.cpp (NVFP4)	NVFP4	~38
llama.cpp (MXFP4)	MXFP4	~43
Ollama (default Q4)	Q4_K_M	~24

vLLM with MTP-1 wins single-user throughput by a large margin, but that 55.9 tok/s requires the cu130-nightly container and explicit --moe-backend=flashinfer_cutlass flag. Without those, vLLM falls below llama.cpp.

Concurrency: where vLLM dominates

At c=32 concurrent users, vLLM’s continuous batching and paged KV-cache make the difference:

Stack	c=32 total tok/s
vLLM (NVFP4 + MTP)	433
llama.cpp (llama-server, NVFP4)	~95
Ollama	~60

llama.cpp’s sequential KV-cache means it can’t pipeline 32 users efficiently. For production serving with real concurrency, vLLM is the correct choice.

When to use each stack

Use vLLM when:

Running MoE models (Qwen3, Mixtral) where the flashinfer_cutlass backend gives an extra 30% over TRITON-only
Serving multiple concurrent users (>4)
You need speculative decoding (MTP) for latency-sensitive single-user workloads
You want Prometheus metrics and OpenAI-compatible API out of the box

Use llama.cpp when:

You need MXFP4 precision (highest prompt throughput, not yet in vLLM)
Building a local dev setup without Docker
Running a model that doesn’t have an NVFP4 HuggingFace upload yet (quantize locally with convert_hf_to_gguf.py)
You want minimal dependencies — a single binary, no Python environment

Use Ollama when:

Prototyping or developer-only workloads where zero-config beats raw performance
You want GUI frontends (Open WebUI, Continue.dev) that target the Ollama API
Running smaller models (≤14B) where the overhead doesn’t matter

The TRITON-only gotcha

On SM121, FP8 MoE in vLLM is TRITON-only — FLASHINFER, CUTLASS, and DEEPGEMM are not available for FP8. This is why untuned vLLM FP8 underperforms: it falls back to the slower backend. NVFP4 gets flashinfer_cutlass via the explicit flag, which is how 55.9 tok/s becomes achievable.