Builder Daily

2026-04-25

The teleoperation data engine — why ALOHA-2 and GELLO are the new training corpus

Embodied AI is bottlenecked on bimanual manipulation data. A $35K ALOHA-2 rig collects 50 hrs/day. A $300 GELLO rig is 100x cheaper but slower. Here is the operational reality of running a teleoperation farm in 2026.

If foundation models for language are bottlenecked on the open web, foundation models for manipulation are bottlenecked on bimanual teleoperation demonstrations. There is no Common Crawl for “human hands picking up cups.” There is only the data you collect, one trajectory at a time, with a human operator. Two hardware platforms have become the de facto standard.

ALOHA-2 — the dual-arm production rig

Stanford’s ALOHA-2 is a low-cost (relative to industrial robotics) bimanual teleoperation platform. The rig has two leader arms (puppeteered by a human) and two follower arms (the “robot”), with high-resolution wrist cameras and a top-down scene camera. Per-rig cost is about $35,000 in parts, which has become the price of admission for serious manipulation research.

What ALOHA-2 actually buys you: high-quality, low-latency bimanual data with synchronized images, joint angles, and gripper states. A trained operator can collect 6-8 hours of usable demonstrations per day per rig (raw collection time is higher; usable yield is what matters because failed attempts must be filtered).

Most published 2026 VLA papers (RT-3, π0.5, GR00T N2) used ALOHA-2 or a derivative for fine-tuning data. The ALOHA-2 dataset format is now the de facto interchange standard.

GELLO — the $300 alternative

GELLO is a clever idea: instead of expensive leader arms, use a 3D-printed kinematic replica of the follower arm with cheap encoders. You move the GELLO model with your hands, and the follower mirrors. Per-rig cost: $300-500 in parts. The result is data that is lower-quality than ALOHA-2 (no force feedback, lower precision on fine-motor tasks) but 100x cheaper to scale.

GELLO is the right answer when you need to scale to dozens of operators or when the manipulation tasks are simple (pick-and-place, drawer opening). It is the wrong answer for surgical-precision tasks or contact-rich manipulation (peeling, cutting, assembly).

The Open X-Embodiment 2.0 dataset includes about 30% GELLO-collected data and 50% ALOHA-2 data; the remaining 20% is industrial robot demos.

The labeling crisis

Here is the part that is rarely discussed publicly: collection is no longer the bottleneck — labeling is. A single ALOHA-2 rig generates ~50 GB/day of multi-modal data. Reviewing it for quality (filtering failed attempts, segmenting task boundaries, annotating sub-task labels) takes 4-6 hours of human time per 1 hour collected.

Most teams running teleoperation farms in 2026 have a 5-10x backlog of unlabeled data. Solutions in production:

Cost economics for 2026

If you are starting a manipulation research lab today:

Practitioner notes

What to watch in Q3


Sources

Tip