2026-05-01
vLLM vs llama.cpp vs Ollama on DGX Spark — which inference stack to use
Decision guide for inference stacks on GB10: vLLM wins for MoE+concurrency, llama.cpp for MXFP4 prompts and single-user, Ollama for zero-config dev. Includes NVFP4 tok/s comparison.
Three inference stacks dominate DGX Spark deployments: vLLM, llama.cpp, and Ollama. They have meaningfully different tradeoffs on GB10’s SM121 architecture.
At a glance
| vLLM | llama.cpp | Ollama | |
|---|---|---|---|
| NVFP4 support | ✅ (cu130-nightly) | ✅ (PR #22196) | ⚠️ via llama.cpp backend |
| MoE models | ✅ best | ✅ good | ✅ good |
| Multi-user concurrency | ✅ excellent | ⚠️ limited | ⚠️ limited |
| MTP speculative decoding | ✅ | ❌ | ❌ |
| Setup complexity | High | Medium | Low |
| OpenAI-compatible API | ✅ | ✅ (llama-server) | ✅ |
Single-user throughput: Qwen3.6-35B-A3B
| Stack | Quantization | Single-user tok/s |
|---|---|---|
| vLLM (FP8, no MTP) | FP8 | 28–33 |
| vLLM (NVFP4, no MTP) | NVFP4 | ~42 |
| vLLM (NVFP4 + MTP-1) | NVFP4 | 55.9 |
| llama.cpp (NVFP4) | NVFP4 | ~38 |
| llama.cpp (MXFP4) | MXFP4 | ~43 |
| Ollama (default Q4) | Q4_K_M | ~24 |
vLLM with MTP-1 wins single-user throughput by a large margin, but that 55.9 tok/s requires the cu130-nightly container and explicit --moe-backend=flashinfer_cutlass flag. Without those, vLLM falls below llama.cpp.
Concurrency: where vLLM dominates
At c=32 concurrent users, vLLM’s continuous batching and paged KV-cache make the difference:
| Stack | c=32 total tok/s |
|---|---|
| vLLM (NVFP4 + MTP) | 433 |
| llama.cpp (llama-server, NVFP4) | ~95 |
| Ollama | ~60 |
llama.cpp’s sequential KV-cache means it can’t pipeline 32 users efficiently. For production serving with real concurrency, vLLM is the correct choice.
When to use each stack
Use vLLM when:
- Running MoE models (Qwen3, Mixtral) where the
flashinfer_cutlassbackend gives an extra 30% over TRITON-only - Serving multiple concurrent users (>4)
- You need speculative decoding (MTP) for latency-sensitive single-user workloads
- You want Prometheus metrics and OpenAI-compatible API out of the box
Use llama.cpp when:
- You need MXFP4 precision (highest prompt throughput, not yet in vLLM)
- Building a local dev setup without Docker
- Running a model that doesn’t have an NVFP4 HuggingFace upload yet (quantize locally with convert_hf_to_gguf.py)
- You want minimal dependencies — a single binary, no Python environment
Use Ollama when:
- Prototyping or developer-only workloads where zero-config beats raw performance
- You want GUI frontends (Open WebUI, Continue.dev) that target the Ollama API
- Running smaller models (≤14B) where the overhead doesn’t matter
The TRITON-only gotcha
On SM121, FP8 MoE in vLLM is TRITON-only — FLASHINFER, CUTLASS, and DEEPGEMM are not available for FP8. This is why untuned vLLM FP8 underperforms: it falls back to the slower backend. NVFP4 gets flashinfer_cutlass via the explicit flag, which is how 55.9 tok/s becomes achievable.
If you see vLLM FP8 performing worse than llama.cpp, this is why — set --moe-backend=flashinfer_cutlass and switch to NVFP4.