2026-05-10

Fine-tuning vision-language-action models on a single DGX Spark — what works in May 2026

Pi-0.5, OpenVLA-2, and RT-2-Edge can all be LoRA-finetuned on a single DGX Spark (128 GB unified) with 100-300 demos. The 4-hour overnight run gives you a deployable robot policy.

For roboticists running a DGX Spark (or comparable 128 GB unified-memory consumer-tier rig), the May 2026 question of “can I fine-tune a vision-language-action model overnight on this thing” has a clear yes answer for three model families. Here’s what actually works.

The three options

Model	Params	LoRA fits in	Training time / 200 demos	Inference on Spark
OpenVLA-2-7B	7B	24 GB peak	3.5 hours	8 Hz
Pi-0.5	3B (visual+action heads)	14 GB peak	1.8 hours	22 Hz
RT-2-Edge	5B	19 GB peak	2.7 hours	14 Hz

All three fit comfortably in DGX Spark’s 128 GB unified memory with room for the full optimizer state. None require gradient checkpointing.

What you need before starting

100-300 demonstrations for the target task class (more is better but plateaus past 300 for LoRA)
Demonstrations in Loomy or LeRobot format (most public datasets ship in one of these)
Robot or sim environment matching your evaluation deployment
8-hour battery / wall power on the Spark — fine-tuning doesn’t crash but does keep GB10 at 95-98°C the entire run

Recommended starter config (Pi-0.5)

Pi-0.5 is the easiest entry point — fastest training, smallest memory footprint, ships a public LoRA recipe.

git clone https://github.com/physical-intelligence/pi-zero-five
cd pi-zero-five
pip install -e .[lora]
python tools/finetune_lora.py \
  --model pi-zero-five-base \
  --demos /path/to/your-task-demos \
  --output ./checkpoints/your-task \
  --rank 32 --steps 8000 --lr 1e-4 \
  --eval_every 1000 --eval_demos 20

Expect ~1.8 hours for 200 demonstrations, 8000 steps. Save every 1000 steps and pick the eval-best checkpoint.

What doesn’t work yet on Spark

Full fine-tuning of any VLA larger than 3B — the optimizer state blows past 128 GB
Multi-GPU training — Spark is single-GPU; for 7B+ full FT you still need an H100 cluster or a 2-Spark NIXL setup (see 2026-05-09 disaggregated serving article)
Real-time inference at >30 Hz — bandwidth-bound; OpenVLA-2 caps at 8 Hz, RT-2-Edge at 14 Hz, Pi-0.5 at 22 Hz

Practitioner note