arXiv 2606.13672·2026-06-11 — views

WEAVER: a world model for robotic manipulation that simultaneously achieves fidelity, long-horizon consistency, and fast inference

Jain, Wu, Farebrother, Swamy, Bajcsy

WEAVER (CMU) is a learned world model for robotic manipulation that resolves the fidelity-consistency-speed trilemma in one architecture. High accuracy, long-horizon coherence, and real-time inference together enable test-time planning with minimal real-world interaction.

arxiv.org/abs/2606.13672 ↗

What the paper does

arXiv:2606.13672 (cs.RO, submitted June 11, 2026) from the Carnegie Mellon University robotics group (Jain, Wu, Farebrother, Swamy, Bajcsy) introduces WEAVER — a learned world model specifically designed for robotic manipulation tasks. The paper’s central claim is that WEAVER resolves what the authors call the world-model trilemma in robotics: the observation that existing systems are forced to trade off between fidelity (generated trajectories match reality), long-horizon consistency (the model stays coherent over many steps), and inference speed (the model runs fast enough for real-time planning).

Prior work on robotic world models has optimized for one or two of these properties at the cost of the third. Video prediction models (e.g., RSSM variants) offer long-horizon rollouts but drift from reality over extended sequences. Diffusion-based models can achieve high per-frame fidelity but are too slow for test-time planning. Lighter recurrent models are fast but inconsistent over long horizons. WEAVER is presented as a unified architecture that satisfies all three desiderata simultaneously.

Architecture overview

WEAVER uses a hierarchical latent-space design:

Compact state representation — rather than operating on raw video pixels (expensive), WEAVER encodes robot-relevant state into a compact learned representation that captures object positions, contact states, and task-relevant geometry. This encoding enables fast latent-space rollouts without per-step pixel decoding.

Multi-scale temporal architecture — WEAVER uses two temporal processing layers: a fast-update layer that tracks short-horizon dynamics (contact forces, gripper state, object inertia) and a slow-update layer that maintains long-horizon consistency (task structure, goal state, object identity over occlusion). The two layers share information via cross-attention, allowing the fast layer to correct drift in the slow layer and vice versa.

Fidelity anchoring — periodically during rollout, WEAVER anchors latent predictions to observed states from the real robot, using a learned alignment module that projects the anchor into the latent trajectory rather than resetting the rollout. This prevents the slow drift accumulation that degrades single-track rollout models.

Evaluation

The paper evaluates WEAVER on standard robotic manipulation benchmarks including MetaWorld and RoboMimic variants, plus custom long-horizon manipulation suites. Key reported results:

State-of-the-art on long-horizon manipulation benchmarks — WEAVER outperforms prior world models on 10+ step manipulation sequences where competing models degrade in consistency
Inference speed sufficient for test-time planning — latent-space rollouts at sufficient speed to run model-predictive control (MPC) loops at 10 Hz or higher on a standard workstation GPU
Policy improvement from planning — policies fine-tuned using WEAVER rollouts show measurable improvement over behavior-cloning baselines, demonstrating that WEAVER rollouts are reliable enough to use for synthetic policy training data

Why world models matter for manipulation

Robotic manipulation is hard to learn from real-world data alone because: real-world trials are slow and wear out hardware; failure cases are dangerous for expensive manipulation setups; and the distribution of interactions needed to learn robust policies is vast. World models solve this by enabling synthetic policy training — generate millions of imagined rollouts in the world model, train the policy on synthetic data, then deploy with minimal real-world fine-tuning.

The bottleneck in this pipeline has been world model quality: if the world model drifts from reality, the synthetic training data poisons the policy. WEAVER’s fidelity anchoring and long-horizon consistency properties directly address the drift problem.

Practical implications

For robotics researchers: WEAVER’s hierarchical temporal design is a concrete architectural template for building world models that work at planning timescales (seconds to minutes) rather than just video timescales (frames). The cross-attention between fast and slow layers is the key inductive bias worth replicating.

For robot system builders: A world model that runs at 10 Hz enables closed-loop model-predictive control — the robot plans a trajectory with WEAVER, starts executing, gets new observations, replans, and iterates. This is qualitatively better than open-loop plans from slower diffusion-based models. The question for deployment is what happens at the fidelity-anchoring step when real observations are noisy or delayed — robustness to sensor noise in the anchoring module is the key open question.

For AI researchers: WEAVER’s structure parallels recent architectural trends in language models (fast attention for local context, slow global attention for long-range dependencies) applied to the temporal dimension in robotics. The cross-domain analogy suggests that the hierarchical fast/slow pattern may be a general inductive bias for sequential prediction tasks with multi-scale dynamics.

Practitioner note

If you are building a robot manipulation system and need to choose between policy cloning, offline RL, and world-model-based planning: WEAVER makes the world-model path meaningfully more attractive by solving the speed-consistency tradeoff that made prior models impractical for MPC. The practical test is whether WEAVER’s fidelity holds in your specific manipulation domain — the paper evaluates on benchmark tasks that, while diverse, do not cover every manipulation configuration. The ablation to run first: does fidelity anchoring with your sensor stack (camera latency, calibration error, object occlusion patterns) maintain trajectory coherence, or does it introduce anchoring errors that destabilize planning? That is the critical empirical question before adopting WEAVER for a production manipulation system.

Under-considered angle

WEAVER’s long-horizon consistency improvement has an implication the paper does not emphasize: data efficiency. If the world model stays faithful over 50-step manipulation sequences, you need far fewer real-world demonstrations to train a capable policy — the world model can extrapolate from fewer anchored observations into more diverse imagined experience. The scaling law for real-world data collection in manipulation is what makes this field expensive; any architectural improvement in world model fidelity translates directly into a reduction in the required number of physical robot trials. WEAVER’s contribution may be less “better planning at inference time” and more “halve your robot-hour data collection budget” — a framing that is more valuable to a lab operating physical hardware than the benchmark numbers suggest.