2026-05-09

DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode

A community pattern pairs DGX Spark for prefill (~1,723 tok/s on GPT-OSS-120B) with Mac Studio M3 Ultra for decode (819 GB/s) to hit 2.8× end-to-end vs single-Spark FP8.

A community writeup published May 5 has been trending through DGX Spark forums all week: pairing a DGX Spark with a Mac Studio M3 Ultra via disaggregated serving yields a measured 2.8× end-to-end speedup on GPT-OSS-120B vs a single-Spark FP8 baseline.

The bandwidth math behind the pattern

DGX Spark and Mac Studio M3 Ultra have asymmetric strengths that map cleanly onto LLM workload phases:

Phase	Bottleneck	DGX Spark (GB10)	Mac Studio M3 Ultra
Prefill (process input)	Compute (TFLOPS)	Strong — Blackwell tensor cores	Weaker — unified memory ALU
Decode (generate tokens)	Memory bandwidth	273 GB/s LPDDR5X	819 GB/s unified memory

GPT-OSS-120B prefill on DGX Spark measured ~1,723 tok/s (compute-bound, Spark wins). Decode on the same hardware caps at the bandwidth floor (~36 tok/s FP8). Mac Studio’s 3× higher memory bandwidth makes it ~2× faster at decode. Splitting the workload — Spark prefills, Mac Studio decodes — combines each device’s strength.

The measured results

Config	GPT-OSS-120B tok/s	Notes
Single Spark FP8	36	bandwidth-bound on decode
Single Spark NVFP4 (post-CES)	49.7	NVFP4 packing helps
Spark + Mac Studio disaggregated	~100+ end-to-end	2.8× vs Spark FP8 baseline

The disaggregation is plumbed via NIXL or a similar prefill/decode controller. Both vLLM 0.20+ and TensorRT-LLM 1.3.0rc14+ support the pattern natively — TRT-LLM PR #13198 (“KV-aware ADP routing”) landed in this week’s release and is the cleanest path on the NVIDIA stack.

When this matters and when it doesn’t

Strong candidates for disaggregation:

Dense models 70B–150B at FP8 / NVFP4 (decode is bandwidth-pinned)
Long-context workloads where prefill cost dominates per-request
Multi-turn agent loops where TTFT determines UX

Weaker candidates:

A3B-class MoE (Qwen3.6-35B-A3B etc.) — the active-parameter math reduces decode bandwidth pressure; single-Spark already serves competitively at 55+ tok/s
Anything under 30B — overhead of disaggregation eats the gain

What to do

If you have a Mac Studio M3 Ultra (or even M2 Ultra) sitting idle, this is the highest-ROI weekend project for any DGX Spark operator running 70B+ dense models. Setup roughly:

Install TRT-LLM 1.3.0rc14 on Spark with --enable-disaggregated
Install MLX or llama.cpp Metal on Mac Studio for the decode role
Connect via NIXL or the TRT-LLM-native KV transfer (PR #13198)
Route via vLLM’s prefill/decode router or a thin custom layer

For Spark-only operators, the practical takeaway is the inverse: prefer A3B-class MoE over dense 30B+ for sustained single-Spark serving. The bandwidth math is unkind to dense models above 27B even at NVFP4.

DGX Spark + Mac Studio disaggregated serving — 2.8× speedup on GPT-OSS-120B by splitting prefill from decode

The bandwidth math behind the pattern

The measured results

When this matters and when it doesn’t

What to do

Sources