Builder Daily

2026-05-04

Qwen3 MoE on DGX Spark — NVFP4 vs FP8 benchmarks and what actually works

Community-verified numbers for Qwen3.6-35B-A3B and Qwen3.5-122B-A10B on GB10: NVFP4+MTP reaches 55.9 tok/s single-user, 433 tok/s at c=32. Covers the TRITON-only MoE backend gotcha and the MTP+prefix-cache failure mode.

The community benchmark picture for Qwen3 MoE on DGX Spark (GB10, 128 GB unified LPDDR5X) has stabilized in May 2026. Here are the verified numbers and the configs that produce them.

Qwen3.6-35B-A3B — the daily-driver config

Model: RedHatAI/Qwen3.6-35B-A3B-NVFP4 Container: vllm/vllm-openai:cu130-nightly (NVFP4 requires this — other vLLM images fail with quantization format errors) Backend: --moe-backend=flashinfer_cutlass

ScenarioOutput tok/sMean TTFTMTP accept %
Single user (512 in / 512 out)55.9166 ms85.4%
Concurrency 32 (1024 in / 512 out)433.42,317 ms85.2%
Long output (4096 tokens)158.0251 ms92.8%

MTP-1 (speculative decoding with 1 draft token) is responsible for the jump from ~32 tok/s (FP8 baseline) to 55.9 tok/s single-user. The 85–93% acceptance rate means the draft token is accepted most of the time — this model is predictable enough that speculative decoding pays off strongly.

Without MTP, FP8 plateaus at 28–33 tok/s single-user and 155.6 tok/s at c=32. The memory-bandwidth ceiling of 273 GB/s LPDDR5X is the hard wall for dense token generation.

Qwen3.5-122B-A10B — the optimization ladder

This larger model requires vLLM 0.19+ and the Marlin kernel fix (April 2026). The optimization ladder from the NVIDIA forums:

ConfigTok/sGain
Baseline INT4 (Intel AutoRound)28.3
+ Hybrid INT4+FP8 on shared expert layers30.8+8.8%
+ MTP-1 speculative decoding (FlashInfer)38.4+35.8%

38.4 tok/s is the verified hardware ceiling for this model on a single Spark. Task variance is real: 36.3 tok/s on short math outputs, 39.9 tok/s on long code generation (2048+ tokens).

Gotchas that cost hours

1. MoE backend is TRITON only on SM121. FLASHINFER, CUTLASS, and DEEPGEMM are not available for FP8 MoE on consumer Blackwell. NVFP4 gets flashinfer_cutlass via the explicit flag — don’t leave this unset.

2. MTP + prefix caching = failures. Running both speculative decoding and prefix caching caused 17/32 request failures at c=8 (FP8 model). Run one or the other until this is fixed upstream.

3. NVFP4 was slower than INT4 until April 2026. The Marlin kernel fix changed this. If you’re on a pre-April container and NVFP4 shows 16.6 tok/s (less than your INT4 baseline of 28.3), update the container.

4. CUDA graph compilation takes 5–8 minutes. The server appears hung on first start — it’s compiling. Set your health check / readiness probe timeout to 600 seconds minimum.

5. Qwen3-27B unquantized ceiling. With BF16/FP8, bandwidth math gives ~10 tok/s theoretical max (27 GB × 1 byte = 27 GB per pass, ÷ 270 GB/s). MTP-3 reaches 15.2 tok/s best case. For interactive use at this model size, NVFP4 is the right choice.

Quick-start command

docker run -d --gpus all --ipc host --shm-size 64gb \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:cu130-nightly \
  RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name qwen3.5-35b \
  --host 0.0.0.0 --port 8000 \
  --dtype bfloat16 --gpu-memory-utilization 0.9 \
  --max-model-len 262144 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --moe-backend=flashinfer_cutlass

For MTP-1 speculative decoding, add: --speculative-model-type=ngram --num-speculative-tokens=1


Sources

Tip