DGX Spark deployment notes
Real-world findings from the NVIDIA DGX Spark / GB10 community on local LLM deployment.
2026-05-09
DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode
A community pattern pairs DGX Spark for prefill (~1,723 tok/s on GPT-OSS-120B) with Mac Studio M3 Ultra for decode (819 GB/s) to hit 2.8× end-to-end vs single-Spark FP8.
2026-05-09
Litespark ternary-CPU inference (arXiv 2605.06485) — 9.2× TTFT, 52× throughput, ships pip package
Litespark replaces FP matmul with integer add/sub SIMD on ternary {-1,0,+1} weight networks. 9.2× TTFT, 52× throughput, 14× memory reduction. Pip-installable, HF-integrated.
2026-05-09
llama.cpp lands Gemma 4 26B-A4B NVFP4 (b9080) and MiMo-V2.5 attention kernels (b9085)
llama.cpp b9080–b9085 add native Gemma 4 26B-A4B NVFP4 (52 tok/s on Spark, 82 GB free for KV) and MiMo-V2.5 flash-attention tiles for d_kq=192/d_v=128 GQA shapes.
2026-05-09
TensorRT-LLM v1.3.0rc14 — Qwen3.5 NVFP4 weight-loading fix lands, Mamba-hybrid prefix caching enabled
TRT-LLM 1.3.0rc14 (May 7) lands the Qwen3.5 NVFP4 weight_scales fix, Mamba-hybrid prefix caching, NVFP4 weight-update, DFlash one-model spec-dec, and a Spark-named GEMM perf PR.
2026-05-04
Qwen3 MoE on DGX Spark — NVFP4 vs FP8 benchmarks and what actually works
Community-verified numbers for Qwen3.6-35B-A3B and Qwen3.5-122B-A10B on GB10: NVFP4+MTP reaches 55.9 tok/s single-user, 433 tok/s at c=32. Covers the TRITON-only MoE backend gotcha and the MTP+prefix-cache failure mode.
2026-05-03
DGX Spark deployment notes — what the community is actually fighting (2026 Q2)
Six recurring DGX Spark / GB10 deployment pitfalls from the NVIDIA Developer Forums — most are software, not hardware — plus the MoE + NVFP4/MXFP4 consensus.
2026-05-02
llama.cpp NVFP4 and MXFP4 build guide for GB10 (SM121)
Step-by-step build flags for llama.cpp NVFP4/MXFP4 on DGX Spark GB10 (SM121). gpt-oss-120B MXFP4 hits pp2048=1,980 tok/s and tg32=35 tok/s after the PR #22196 merge.
2026-05-01
vLLM vs llama.cpp vs Ollama on DGX Spark — which inference stack to use
Decision guide for inference stacks on GB10: vLLM wins for MoE+concurrency, llama.cpp for MXFP4 prompts and single-user, Ollama for zero-config dev. Includes NVFP4 tok/s comparison.
2026-04-30
LiteLLM + Claude Code on DGX Spark — LAN serving setup and protocol translation
Route Claude Code API calls to a self-hosted Qwen3 model on DGX Spark via LiteLLM proxy. Covers config, model alias mapping, multi-GPU offload, and the latency tradeoffs vs cloud API.