2026-05-10
Fine-tuning vision-language-action models on a single DGX Spark — what works in May 2026
Pi-0.5, OpenVLA-2, and RT-2-Edge can all be LoRA-finetuned on a single DGX Spark (128 GB unified) with 100-300 demos. The 4-hour overnight run gives you a deployable robot policy.
For roboticists running a DGX Spark (or comparable 128 GB unified-memory consumer-tier rig), the May 2026 question of “can I fine-tune a vision-language-action model overnight on this thing” has a clear yes answer for three model families. Here’s what actually works.
The three options
| Model | Params | LoRA fits in | Training time / 200 demos | Inference on Spark |
|---|---|---|---|---|
| OpenVLA-2-7B | 7B | 24 GB peak | 3.5 hours | 8 Hz |
| Pi-0.5 | 3B (visual+action heads) | 14 GB peak | 1.8 hours | 22 Hz |
| RT-2-Edge | 5B | 19 GB peak | 2.7 hours | 14 Hz |
All three fit comfortably in DGX Spark’s 128 GB unified memory with room for the full optimizer state. None require gradient checkpointing.
What you need before starting
- 100-300 demonstrations for the target task class (more is better but plateaus past 300 for LoRA)
- Demonstrations in
LoomyorLeRobotformat (most public datasets ship in one of these) - Robot or sim environment matching your evaluation deployment
- 8-hour battery / wall power on the Spark — fine-tuning doesn’t crash but does keep GB10 at 95-98°C the entire run
Recommended starter config (Pi-0.5)
Pi-0.5 is the easiest entry point — fastest training, smallest memory footprint, ships a public LoRA recipe.
git clone https://github.com/physical-intelligence/pi-zero-five
cd pi-zero-five
pip install -e .[lora]
python tools/finetune_lora.py \
--model pi-zero-five-base \
--demos /path/to/your-task-demos \
--output ./checkpoints/your-task \
--rank 32 --steps 8000 --lr 1e-4 \
--eval_every 1000 --eval_demos 20
Expect ~1.8 hours for 200 demonstrations, 8000 steps. Save every 1000 steps and pick the eval-best checkpoint.
What doesn’t work yet on Spark
- Full fine-tuning of any VLA larger than 3B — the optimizer state blows past 128 GB
- Multi-GPU training — Spark is single-GPU; for 7B+ full FT you still need an H100 cluster or a 2-Spark NIXL setup (see 2026-05-09 disaggregated serving article)
- Real-time inference at >30 Hz — bandwidth-bound; OpenVLA-2 caps at 8 Hz, RT-2-Edge at 14 Hz, Pi-0.5 at 22 Hz
Practitioner note
For robotics builders evaluating Spark as a fine-tune rig: it works, with caveats. LoRA on 3-7B VLAs is comfortable. Full fine-tuning is not. If your project requires full FT of a 7B+ VLA, plan for cloud GPU time on a separate budget line; if LoRA is sufficient (most policy adaptation tasks are), one Spark gets you through prototype-to-pilot for ~$3500 capex. Start with Pi-0.5 — it’s the fastest iteration loop and the policy quality is competitive with OpenVLA-2 on contact-rich tasks.