2026-04-25

The teleoperation data engine — why ALOHA-2 and GELLO are the new training corpus

Embodied AI is bottlenecked on bimanual manipulation data. A $35K ALOHA-2 rig collects 50 hrs/day. A $300 GELLO rig is 100x cheaper but slower. Here is the operational reality of running a teleoperation farm in 2026.

If foundation models for language are bottlenecked on the open web, foundation models for manipulation are bottlenecked on bimanual teleoperation demonstrations. There is no Common Crawl for “human hands picking up cups.” There is only the data you collect, one trajectory at a time, with a human operator. Two hardware platforms have become the de facto standard.

ALOHA-2 — the dual-arm production rig

Stanford’s ALOHA-2 is a low-cost (relative to industrial robotics) bimanual teleoperation platform. The rig has two leader arms (puppeteered by a human) and two follower arms (the “robot”), with high-resolution wrist cameras and a top-down scene camera. Per-rig cost is about $35,000 in parts, which has become the price of admission for serious manipulation research.

What ALOHA-2 actually buys you: high-quality, low-latency bimanual data with synchronized images, joint angles, and gripper states. A trained operator can collect 6-8 hours of usable demonstrations per day per rig (raw collection time is higher; usable yield is what matters because failed attempts must be filtered).

Most published 2026 VLA papers (RT-3, π0.5, GR00T N2) used ALOHA-2 or a derivative for fine-tuning data. The ALOHA-2 dataset format is now the de facto interchange standard.

GELLO — the $300 alternative

GELLO is a clever idea: instead of expensive leader arms, use a 3D-printed kinematic replica of the follower arm with cheap encoders. You move the GELLO model with your hands, and the follower mirrors. Per-rig cost: $300-500 in parts. The result is data that is lower-quality than ALOHA-2 (no force feedback, lower precision on fine-motor tasks) but 100x cheaper to scale.

GELLO is the right answer when you need to scale to dozens of operators or when the manipulation tasks are simple (pick-and-place, drawer opening). It is the wrong answer for surgical-precision tasks or contact-rich manipulation (peeling, cutting, assembly).

The Open X-Embodiment 2.0 dataset includes about 30% GELLO-collected data and 50% ALOHA-2 data; the remaining 20% is industrial robot demos.

The labeling crisis

Here is the part that is rarely discussed publicly: collection is no longer the bottleneck — labeling is. A single ALOHA-2 rig generates ~50 GB/day of multi-modal data. Reviewing it for quality (filtering failed attempts, segmenting task boundaries, annotating sub-task labels) takes 4-6 hours of human time per 1 hour collected.

Most teams running teleoperation farms in 2026 have a 5-10x backlog of unlabeled data. Solutions in production:

VLM-as-judge: GPT-5 or Claude Sonnet 4 reviewing video clips for quality. Cuts human review time by 60-70% but requires careful prompt engineering and spot-check audits.
Self-supervised labeling: cluster trajectories by latent representation (V-JEPA-2 features), label one cluster, propagate. Works for repetitive tasks; breaks for diverse data.
Just labeling less: skip sub-task annotation, train end-to-end on raw trajectories. Works for VLA models that are big enough; loses interpretability.

Cost economics for 2026

If you are starting a manipulation research lab today:

One ALOHA-2 rig + one full-time operator + part-time labeler ≈ $250K/yr all-in. Output: ~1,500 hours of usable demonstrations per year. Enough to fine-tune a base VLA on 2-3 specialized tasks.
A 5-rig GELLO farm with 5 operators ≈ $400K/yr. Output: ~6,000 hours/year of broader-but-noisier data. Good for foundation-model pretraining contributions.
Buying access to existing datasets (Open X-Embodiment 2.0, RH20T, BridgeData V2) is the rational starting point. ~$0 to download, but everyone is training on the same data so you do not get a competitive moat.

Practitioner notes

Do not start with GELLO if your manipulation task involves contact forces. The lack of force feedback bites in ways that are not obvious until 200 hours in.
Operator selection matters more than people admit. A skilled operator collects 3-5x more usable data per hour than a novice. Pay accordingly.
Camera placement is the most-debated and least-standardized part of ALOHA-2 setups. Wrist cameras + a top-down camera is the safe default.
The fastest way to make a dent: contribute to Open X-Embodiment 3.0 (call for data closes June 2026). Your data gets used by every VLA paper for the next 2 years.

What to watch in Q3

ALOHA-3 (rumored: better wrist range of motion, halved cost target)
GELLO-2 with optional force feedback module ($600 add-on)
DeepMind RT-X teleoperation rig (industrial-grade, expected $80K)
An open-source labeling pipeline that uses VLMs as quality reviewers