DGX Spark deployment notes

Real-world findings from the NVIDIA DGX Spark / GB10 community on local LLM deployment.

2026-05-09

DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode

A community pattern pairs DGX Spark for prefill (~1,723 tok/s on GPT-OSS-120B) with Mac Studio M3 Ultra for decode (819 GB/s) to hit 2.8× end-to-end vs single-Spark FP8.

2026-05-09

Litespark ternary-CPU inference (arXiv 2605.06485) — 9.2× TTFT, 52× throughput, ships pip package

Litespark replaces FP matmul with integer add/sub SIMD on ternary {-1,0,+1} weight networks. 9.2× TTFT, 52× throughput, 14× memory reduction. Pip-installable, HF-integrated.

2026-05-09

llama.cpp lands Gemma 4 26B-A4B NVFP4 (b9080) and MiMo-V2.5 attention kernels (b9085)

llama.cpp b9080–b9085 add native Gemma 4 26B-A4B NVFP4 (52 tok/s on Spark, 82 GB free for KV) and MiMo-V2.5 flash-attention tiles for d_kq=192/d_v=128 GQA shapes.

2026-05-09

TensorRT-LLM v1.3.0rc14 — Qwen3.5 NVFP4 weight-loading fix lands, Mamba-hybrid prefix caching enabled

TRT-LLM 1.3.0rc14 (May 7) lands the Qwen3.5 NVFP4 weight_scales fix, Mamba-hybrid prefix caching, NVFP4 weight-update, DFlash one-model spec-dec, and a Spark-named GEMM perf PR.

2026-05-04

Qwen3 MoE on DGX Spark — NVFP4 vs FP8 benchmarks and what actually works

Community-verified numbers for Qwen3.6-35B-A3B and Qwen3.5-122B-A10B on GB10: NVFP4+MTP reaches 55.9 tok/s single-user, 433 tok/s at c=32. Covers the TRITON-only MoE backend gotcha and the MTP+prefix-cache failure mode.

2026-05-03

DGX Spark deployment notes — what the community is actually fighting (2026 Q2)

Six recurring DGX Spark / GB10 deployment pitfalls from the NVIDIA Developer Forums — most are software, not hardware — plus the MoE + NVFP4/MXFP4 consensus.

2026-05-02

llama.cpp NVFP4 and MXFP4 build guide for GB10 (SM121)

Step-by-step build flags for llama.cpp NVFP4/MXFP4 on DGX Spark GB10 (SM121). gpt-oss-120B MXFP4 hits pp2048=1,980 tok/s and tg32=35 tok/s after the PR #22196 merge.

2026-05-01

vLLM vs llama.cpp vs Ollama on DGX Spark — which inference stack to use

Decision guide for inference stacks on GB10: vLLM wins for MoE+concurrency, llama.cpp for MXFP4 prompts and single-user, Ollama for zero-config dev. Includes NVFP4 tok/s comparison.

2026-04-30

LiteLLM + Claude Code on DGX Spark — LAN serving setup and protocol translation

Route Claude Code API calls to a self-hosted Qwen3 model on DGX Spark via LiteLLM proxy. Covers config, model alias mapping, multi-GPU offload, and the latency tradeoffs vs cloud API.