2026-05-03

DGX Spark deployment notes — what the community is actually fighting (2026 Q2)

Six recurring DGX Spark / GB10 deployment pitfalls from the NVIDIA Developer Forums — most are software, not hardware — plus the MoE + NVFP4/MXFP4 consensus.

If you’re standing up a NVIDIA DGX Spark / GB10 box for local LLM serving, the NVIDIA Developer Forums “DGX Spark / GB10” category is the highest-signal place to read first. Here’s what the threads are documenting in early 2026, summarized for builders.

Six recurring failure modes (suspect software before hardware)

1. GPU stuck at ~5W / 0% utilization under load

Driver/CUDA mismatch. Known-good as of 2026-01: Driver 580.95.05 + CUDA 13.0. The older 550.54.15 + CUDA 12.4 combo is broken on Spark. Update both before assuming the GPU is dead.

2. “Thermal throttling” at 80–86°C

Usually a false alarm — those temps are within spec for Spark. Real cause is often filesystem cache filling unified memory and confusing legacy CUDA tools that report stale state.

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

3. Dense 70B FP8 at 2–3 tok/s

Not a config bug — it’s the 273 GB/s LPDDR5X memory bandwidth ceiling on dense models at this size. The community consensus: switch to MoE models that activate fewer params per token (gpt-oss-120b activates ~5B, Qwen3-MoE, GLM), or use speculative decoding with a draft model.

4. Multi-node NCCL silently slow

ConnectX-7 NCCL falls back from RoCE to TCP socket without an error if pods aren’t privileged or VF NetworkAttachmentDefinitions are missing. The delta is huge: 2.12 → 9.78 GB/s (4.6×) when RoCE actually engages. Always verify the transport before assuming the model code is the bottleneck.

5. System crashes near 126.5 GB unified memory

Don’t assume the full 128 GB is safe headroom. Llama-swap orchestration needs adaptive memory caps below the physical ceiling.

6. ASUS Ascent GX10 stuck at 30W “Safety Mode”

This one is hardware — USB-PD firmware negotiation failure. Affects the ASUS-branded variant; community has documented the symptom.

Quick-triage tool

A community-built spark-doctor CLI checks all six of the above. Run it before opening a forum thread to skip the “did you check…” round-trip.

Quantization consensus for local LLM perf

As of 2026 Q1–Q2, forum consensus runs MoE models at NVFP4 / MXFP4 quantization on Spark — gpt-oss-120b and Qwen3.5-35B-A3B are the two most-cited choices. Native NVFP4 in llama.cpp landed in build b8967 (2026-04-29).

Practitioner note (mine)

Three takeaways for someone bringing up a Spark from scratch in 2026 Q2:

Pin Driver 580.95.05 + CUDA 13.0 at the start. Most performance complaints in forum threads trace to the older driver still being installed.
Don’t try to run dense 70B+ if you care about throughput. Pick an MoE model with a small active-parameter count and you’ll get 5-10× the tok/s for the same memory.
If you go multi-node, verify RoCE actually engages. The silent fallback to TCP is the single most expensive footgun in the threads.

The hardware is fast; most of the complaints in 2026 Q1–Q2 are software state and configuration.