Builder Daily

2026-05-09

DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode

A community pattern pairs DGX Spark for prefill (~1,723 tok/s on GPT-OSS-120B) with Mac Studio M3 Ultra for decode (819 GB/s) to hit 2.8× end-to-end vs single-Spark FP8.

A community writeup published May 5 has been trending through DGX Spark forums all week: pairing a DGX Spark with a Mac Studio M3 Ultra via disaggregated serving yields a measured 2.8× end-to-end speedup on GPT-OSS-120B vs a single-Spark FP8 baseline.

The bandwidth math behind the pattern

DGX Spark and Mac Studio M3 Ultra have asymmetric strengths that map cleanly onto LLM workload phases:

PhaseBottleneckDGX Spark (GB10)Mac Studio M3 Ultra
Prefill (process input)Compute (TFLOPS)Strong — Blackwell tensor coresWeaker — unified memory ALU
Decode (generate tokens)Memory bandwidth273 GB/s LPDDR5X819 GB/s unified memory

GPT-OSS-120B prefill on DGX Spark measured ~1,723 tok/s (compute-bound, Spark wins). Decode on the same hardware caps at the bandwidth floor (~36 tok/s FP8). Mac Studio’s 3× higher memory bandwidth makes it ~2× faster at decode. Splitting the workload — Spark prefills, Mac Studio decodes — combines each device’s strength.

The measured results

ConfigGPT-OSS-120B tok/sNotes
Single Spark FP836bandwidth-bound on decode
Single Spark NVFP4 (post-CES)49.7NVFP4 packing helps
Spark + Mac Studio disaggregated~100+ end-to-end2.8× vs Spark FP8 baseline

The disaggregation is plumbed via NIXL or a similar prefill/decode controller. Both vLLM 0.20+ and TensorRT-LLM 1.3.0rc14+ support the pattern natively — TRT-LLM PR #13198 (“KV-aware ADP routing”) landed in this week’s release and is the cleanest path on the NVIDIA stack.

When this matters and when it doesn’t

Strong candidates for disaggregation:

Weaker candidates:

What to do

If you have a Mac Studio M3 Ultra (or even M2 Ultra) sitting idle, this is the highest-ROI weekend project for any DGX Spark operator running 70B+ dense models. Setup roughly:

  1. Install TRT-LLM 1.3.0rc14 on Spark with --enable-disaggregated
  2. Install MLX or llama.cpp Metal on Mac Studio for the decode role
  3. Connect via NIXL or the TRT-LLM-native KV transfer (PR #13198)
  4. Route via vLLM’s prefill/decode router or a thin custom layer

For Spark-only operators, the practical takeaway is the inverse: prefer A3B-class MoE over dense 30B+ for sustained single-Spark serving. The bandwidth math is unkind to dense models above 27B even at NVFP4.


Sources

Tip