2026-05-02
World models for robotics — Cosmos, Genie 3, and V-JEPA-2 explained for builders
NVIDIA Cosmos, Google Genie 3, and Meta V-JEPA-2 each take a different bet on synthetic training data for embodied AI. Here is what each one is actually good for, and the open question of whether world models can replace teleoperation.
The bottleneck for embodied AI today is not compute or model capacity — it is training data. Teleoperation rigs collect about 50 hours of demonstrations per day per setup, and labeling is the hidden chokepoint. World models offer a way out: if a learned simulator is faithful enough, a robot can train inside it for millions of trajectories. Three labs are leading this bet.
NVIDIA Cosmos — the policy-training workhorse
Cosmos ships as a family of open-weights diffusion + autoregressive models that generate photoreal robot-camera footage conditioned on action sequences. The key spec is physics consistency on contact-rich tasks: pick-and-place, articulated objects, deformables. You feed it a 1-second seed video plus an action plan, and it predicts the next 5–10 seconds of egocentric video the robot would see.
What it is good for: generating millions of synthetic demonstrations for VLA-model fine-tuning. What it is not good for: novel scenes outside its training distribution (factory floors with unusual lighting, cluttered home kitchens). Used by Figure and Boston Dynamics in their 2026 Q1 training pipelines.
Google DeepMind Genie 3 — interactive simulation
Genie 3 is the only one of the three that lets you interact with the simulation in real-time — you provide actions, it generates the next frame. The capability that turned heads in March 2026 was 1280×720 at 24 fps with consistent object permanence over ~2 minutes of interaction.
For robotics, Genie 3’s pitch is reinforcement-learning rollouts in a learned simulator that is closer to real-world distribution than any hand-crafted physics engine. The catch: it does not currently expose contact forces or dynamics in a way that a robot policy can learn from reliably. Best for high-level navigation policies, not manipulation.
Meta V-JEPA-2 — representation, not generation
V-JEPA-2 takes the opposite approach: instead of generating pixels, it learns a latent representation of how scenes evolve. You can use this as a video encoder that gives a robot a useful internal state without the cost of pixel-level generation. The paper claims SOTA on video understanding benchmarks and on action-anticipation tasks.
For builders: V-JEPA-2 is the right pick when you want a frozen perception backbone and you are training the policy on top. It is also the cheapest of the three to run inference on — Cosmos and Genie 3 require multi-GPU inference; V-JEPA-2 fits on a single H100.
The open question — can world models replace teleoperation?
The honest answer in May 2026: not yet, but the gap is closing 2× per year. Three things have to happen for world models to displace real-world data collection:
- Contact-force fidelity. Cosmos is the closest, but it still hallucinates failure modes that do not occur in physical rollouts.
- Long-horizon consistency. Genie 3 drifts after ~2 minutes; real tasks span 5–15 minutes.
- Cross-embodiment generalization. A model trained on humanoid footage does not yet transfer well to wheeled or armed platforms.
Until these are solved, the operational answer is hybrid: use world models to 10× your teleop dataset via augmentation, but keep collecting real demonstrations for the long tail.
Practitioner notes
- If you are training a VLA from scratch: pair Cosmos rollouts with real teleop at a 5:1 synthetic-to-real ratio. More synthetic does not help — empirically diminishing returns past 5:1 in published ablations.
- If you are fine-tuning an existing VLA: skip world-model data entirely and focus on real teleop in your target environment. World-model data dilutes the fine-tune.
- Inference cost matters: Cosmos at 24 fps on a single H100 is about $0.02/sec of generated video. A 1-hour synthetic dataset is roughly $72.
- V-JEPA-2 as a perception encoder for behavior cloning is currently underexploited and a quick win — many teams still use CLIP or DINOv2 by default.