2026-05-09
DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode
A community pattern pairs DGX Spark for prefill (~1,723 tok/s on GPT-OSS-120B) with Mac Studio M3 Ultra for decode (819 GB/s) to hit 2.8× end-to-end vs single-Spark FP8.
A community writeup published May 5 has been trending through DGX Spark forums all week: pairing a DGX Spark with a Mac Studio M3 Ultra via disaggregated serving yields a measured 2.8× end-to-end speedup on GPT-OSS-120B vs a single-Spark FP8 baseline.
The bandwidth math behind the pattern
DGX Spark and Mac Studio M3 Ultra have asymmetric strengths that map cleanly onto LLM workload phases:
| Phase | Bottleneck | DGX Spark (GB10) | Mac Studio M3 Ultra |
|---|---|---|---|
| Prefill (process input) | Compute (TFLOPS) | Strong — Blackwell tensor cores | Weaker — unified memory ALU |
| Decode (generate tokens) | Memory bandwidth | 273 GB/s LPDDR5X | 819 GB/s unified memory |
GPT-OSS-120B prefill on DGX Spark measured ~1,723 tok/s (compute-bound, Spark wins). Decode on the same hardware caps at the bandwidth floor (~36 tok/s FP8). Mac Studio’s 3× higher memory bandwidth makes it ~2× faster at decode. Splitting the workload — Spark prefills, Mac Studio decodes — combines each device’s strength.
The measured results
| Config | GPT-OSS-120B tok/s | Notes |
|---|---|---|
| Single Spark FP8 | 36 | bandwidth-bound on decode |
| Single Spark NVFP4 (post-CES) | 49.7 | NVFP4 packing helps |
| Spark + Mac Studio disaggregated | ~100+ end-to-end | 2.8× vs Spark FP8 baseline |
The disaggregation is plumbed via NIXL or a similar prefill/decode controller. Both vLLM 0.20+ and TensorRT-LLM 1.3.0rc14+ support the pattern natively — TRT-LLM PR #13198 (“KV-aware ADP routing”) landed in this week’s release and is the cleanest path on the NVIDIA stack.
When this matters and when it doesn’t
Strong candidates for disaggregation:
- Dense models 70B–150B at FP8 / NVFP4 (decode is bandwidth-pinned)
- Long-context workloads where prefill cost dominates per-request
- Multi-turn agent loops where TTFT determines UX
Weaker candidates:
- A3B-class MoE (Qwen3.6-35B-A3B etc.) — the active-parameter math reduces decode bandwidth pressure; single-Spark already serves competitively at 55+ tok/s
- Anything under 30B — overhead of disaggregation eats the gain
What to do
If you have a Mac Studio M3 Ultra (or even M2 Ultra) sitting idle, this is the highest-ROI weekend project for any DGX Spark operator running 70B+ dense models. Setup roughly:
- Install TRT-LLM 1.3.0rc14 on Spark with
--enable-disaggregated - Install MLX or llama.cpp Metal on Mac Studio for the decode role
- Connect via NIXL or the TRT-LLM-native KV transfer (PR #13198)
- Route via vLLM’s prefill/decode router or a thin custom layer
For Spark-only operators, the practical takeaway is the inverse: prefer A3B-class MoE over dense 30B+ for sustained single-Spark serving. The bandwidth math is unkind to dense models above 27B even at NVFP4.