Builder Daily

2026-05-01

vLLM vs llama.cpp vs Ollama on DGX Spark — which inference stack to use

Decision guide for inference stacks on GB10: vLLM wins for MoE+concurrency, llama.cpp for MXFP4 prompts and single-user, Ollama for zero-config dev. Includes NVFP4 tok/s comparison.

Three inference stacks dominate DGX Spark deployments: vLLM, llama.cpp, and Ollama. They have meaningfully different tradeoffs on GB10’s SM121 architecture.

At a glance

vLLMllama.cppOllama
NVFP4 support✅ (cu130-nightly)✅ (PR #22196)⚠️ via llama.cpp backend
MoE models✅ best✅ good✅ good
Multi-user concurrency✅ excellent⚠️ limited⚠️ limited
MTP speculative decoding
Setup complexityHighMediumLow
OpenAI-compatible API✅ (llama-server)

Single-user throughput: Qwen3.6-35B-A3B

StackQuantizationSingle-user tok/s
vLLM (FP8, no MTP)FP828–33
vLLM (NVFP4, no MTP)NVFP4~42
vLLM (NVFP4 + MTP-1)NVFP455.9
llama.cpp (NVFP4)NVFP4~38
llama.cpp (MXFP4)MXFP4~43
Ollama (default Q4)Q4_K_M~24

vLLM with MTP-1 wins single-user throughput by a large margin, but that 55.9 tok/s requires the cu130-nightly container and explicit --moe-backend=flashinfer_cutlass flag. Without those, vLLM falls below llama.cpp.

Concurrency: where vLLM dominates

At c=32 concurrent users, vLLM’s continuous batching and paged KV-cache make the difference:

Stackc=32 total tok/s
vLLM (NVFP4 + MTP)433
llama.cpp (llama-server, NVFP4)~95
Ollama~60

llama.cpp’s sequential KV-cache means it can’t pipeline 32 users efficiently. For production serving with real concurrency, vLLM is the correct choice.

When to use each stack

Use vLLM when:

Use llama.cpp when:

Use Ollama when:

The TRITON-only gotcha

On SM121, FP8 MoE in vLLM is TRITON-only — FLASHINFER, CUTLASS, and DEEPGEMM are not available for FP8. This is why untuned vLLM FP8 underperforms: it falls back to the slower backend. NVFP4 gets flashinfer_cutlass via the explicit flag, which is how 55.9 tok/s becomes achievable.

If you see vLLM FP8 performing worse than llama.cpp, this is why — set --moe-backend=flashinfer_cutlass and switch to NVFP4.


Sources

Tip