2026-06-08

The bottleneck behind humanoid robots isn't hardware — it's data, and China is brute-forcing it with paid human video

— views

A June 3 report details how JD.com and others are paying ordinary people $3/hour to film chores so humanoid robots can learn. The real physical-AI race is over training data, and the supply chain is getting industrialized.

What shipped

The flashy physical-AI headlines this spring were about hardware and money — factory lines, deployment plans, billion-dollar rounds. The more important story is quieter and showed up in a June 3 report from Rest of World: the data supply chain for humanoid robots is being industrialized, and right now China is doing it at a scale nobody else is matching.

The clearest example is JD.com. Working with the local government in Suqian, the company is targeting 10 million hours of robotics training data over two years — and the breakdown matters. Per Gasgoo’s reporting, the plan front-loads 5 million hours of real-world human-scenario video in year one, climbs past 10 million hours in year two, and adds roughly 1 million hours of robot-body data on top. To collect it, JD says it will mobilize over 100,000 internal employees and up to 500,000 external workers across 100+ scenarios spanning logistics, manufacturing, healthcare, home services, and urban operations.

The collection method is the part builders should sit with. This isn’t expensive teleoperation rigs in a lab. It’s ordinary people wearing head cameras. Rest of World describes a stay-at-home worker filming chores six hours a day at 20 yuan (about $3) per hour, and a homeowner who paid 149 yuan (about $22) for a three-hour in-home session in which a robot from Shenzhen-based X Square Robot practiced folding a few clothing items and arranging shoes. Factory workers in Guangdong wear head cameras plus wrist sensors to capture hand movements on the line.

Why data, not hardware, is the gate

This only makes sense against the fundamental asymmetry of embodied AI. As MIT Technology Review framed it back in 2024, the data robots need is “far harder to come by than the data used to train the most advanced AI models like GPT — mostly text, images, and videos scraped off the internet.” Language models eat the open web. A robot policy needs synchronized camera frames, joint angles, gripper forces, and task context — all recorded during actual physical interaction. That data simply doesn’t exist at internet scale, and “real-world data is relatively scarce and tends to require a lot more time, effort, and expensive equipment to collect.”

Translation: you can’t scrape your way to a generalist manipulation policy. Somebody has to physically do the tasks, instrumented, millions of times. Whoever can stand up that pipeline cheapest and broadest gets the better model.

That reframes the recent hardware milestones. A factory churning out humanoids per hour is impressive, but a robot with no diverse, task-specific data is an expensive mannequin. The deployments and the data harvest are the same project: every instrumented worker and every robot on a training floor is a data-generation node.

The throughput math

China’s approach is essentially to convert labor into demonstrations at volume. People’s Daily (April 28) reported that at the Shijingshan training center — a facility over 10,000 square meters — 100 humanoid robots began training in October 2025, with each robot generating about four hours of training data per day. At a two-minute sampling interval, the operator says 100 robots can complete at least 12,000 data-collection tasks per day, on chores like folding clothes, sorting parcels, scanning barcodes, and opening locks.

Stacked up, the two collection modes look like this:

Source	Operator	Rate / scale	Cost signal
Home video (egocentric)	Resident, head camera	~6 hrs/day per person	~$3/hr labor; ~$22 per 3-hr robot home visit
Factory capture	Line worker + wrist sensors	Continuous during shift	Layered onto existing wages
Robot training floor	100 robots, instrumented	~4 hrs/robot/day; ~12,000 tasks/day	Fixed facility (10,000+ m²)
Program target (JD, Suqian)	100k internal + 500k external	10M hrs over 2 yrs (5M human-video yr 1)	Subsidized, gov-partnered

The numbers come from different operators and shouldn’t be summed into one clean total, but the direction is unambiguous: human-video data is being treated as a commodity input, priced near minimum-wage labor, and produced in parallel by hundreds of thousands of people.

Practitioner note

If I were building anything that touches manipulation, I’d stop treating data collection as a footnote to the model. A few things I’d actually do:

Budget for data like it’s the main line item, not the model. The cheap, scalable signal here is egocentric human video (head-cam, first-person hands), not lab teleoperation. If you’re paying for clean teleop demos exclusively, you’re paying retail while others buy wholesale.
Design for the cross-embodiment gap up front. Human-hand video and a specific gripper are not the same morphology. The teams winning will be the ones with a retargeting/adaptation layer that turns “human folds a shirt” into “this robot folds a shirt.” Bake that into the data schema (sync timestamps, force where you can get it, consistent camera intrinsics) from day one.
Don’t assume hours equal capability. Ten million hours of someone cooking dinner is not ten million hours of your task. I’d weight a small, tightly-scoped, well-labeled set for the exact deployment over a giant generic dump, and treat the generic corpus as pretraining only.
Watch the consent and provenance angle. Video of people in their homes is the asset. If you ship a product trained on it, you want clean licensing and provenance now, not a discovery problem later.

Under-considered angle

Everyone is debating synthetic data versus real data. The under-discussed variable is who owns the labor pipeline. The reason this is concentrating in one place isn’t a model breakthrough — it’s that you can stand up a “data-collection neighborhood,” subsidize it with local government, and pay people a few dollars an hour to instrument their daily lives. That’s an industrial-policy and logistics advantage, not an algorithmic one, and it’s much harder for a Western startup to replicate at price than another transformer architecture is.

The second-order risk: a generation of generalist robot policies could end up disproportionately trained on one country’s homes, kitchens, factories, and store layouts. That bakes in distributional bias — a robot that’s seen a million Chinese apartments and very few American or European ones may quietly underperform the moment it’s deployed elsewhere, and nobody will see it in a benchmark number. For builders outside that pipeline, the move isn’t to out-spend the human-video harvest. It’s to own a narrow, high-quality, locally-grounded dataset for the exact environment you ship into — and to make sure the foundation model you fine-tune on isn’t silently importing someone else’s living room as the prior.