2026-06-17 — views

AV Software Stack & OTA Pipeline Index — Who Improves Fastest in the Field (Mid-2026)

OTA cadence, simulation depth, and the field-data flywheel that determines how fast Tesla, Waymo, and Baidu actually improve in deployment.

The question behind the benchmarks

Miles driven, disengagement rates, and permit counts measure where an AV system stands today. The question that matters more for long-term competitive positioning is a different one: how fast does the system get better? Improvement velocity is a product of three multiplied factors — data volume, compute throughput, and OTA update cadence — and the three players at the frontier have made fundamentally different architectural bets on each factor. This is the twelfth article in the physical AI benchmark series.

Section 1 — Software architecture comparison

The foundational design choices made at the architecture level cascade into every downstream capability: how data is labeled, how models are trained, how updates are validated, and how quickly a new software version can ship to the live fleet.

Dimension	Tesla FSD	Waymo Driver	Baidu Apollo	Notes
Core architecture	End-to-end neural net (camera to action)	Modular (perception + prediction + planning)	Modular (perception + prediction + planning)	Tesla bets on learned policy; others keep explicit planning
Sensor input	8 cameras only	LiDAR + camera + radar	LiDAR + camera + radar	Single-modality vs sensor fusion
Simulation platform	Tesla Dojo + internal sim	Waymo Simulation (Carcraft)	AADS simulator	Waymo’s Carcraft reportedly runs ~25,000 virtual cars simultaneously
OTA update frequency	Weekly to monthly (consumer FSD)	Quarterly or longer (ops fleet)	Not publicly disclosed	Tesla pushes more frequently than either competitor
Fleet size for OTA testing	6M+ vehicles	~1,500 vehicles	~1,000 vehicles	Tesla A/B tests at population scale
Training data pipeline	Fleet to Dojo to model to OTA	Ops fleet to Google TPU to model to staged rollout	Ops fleet to Baidu cloud	Tesla’s loop is fastest due to fleet scale
Shadow mode testing	Yes — FSD runs silently on non-FSD vehicles	Limited to ops fleet	Not publicly disclosed	Tesla harvests data from non-paying vehicles at no marginal cost

Reading the table: The architectural fork between end-to-end and modular is not simply a technical preference — it determines the speed and cost of iteration. An end-to-end system can improve on a new edge case by retraining on more data; a modular system requires identifying which module failed, relabeling that module’s training set, retraining the module, and re-validating the full stack. Tesla’s architecture is faster to iterate on by design. Waymo’s modular architecture is more interpretable and easier to audit for safety regulators.

Section 2 — The improvement velocity equation

Three factors multiply together to determine how fast a player’s system improves in the field:

Data volume — miles of real-world driving entering the training pipeline per month
Compute — training throughput measured by the infrastructure available to process that data
Update cadence — how often improved models reach the live fleet and begin generating new training data

Metric	Tesla	Waymo	Baidu
Miles per month into training pipeline	1–2 billion (est.)	5–10 million (est.)	20–50 million (est.)
Training compute (relative)	High — Dojo cluster plus NVIDIA GPUs	Medium — Google TPU fleet	Medium — Baidu cloud
OTA cadence	Weekly	Quarterly or longer	Not publicly disclosed
Shadow mode coverage	6M+ vehicles	None	None
Driverless quality miles	Lower — mostly supervised consumer driving	High — all commercially driverless	High — driverless in designated cities

Tesla’s data volume advantage is real and large. At an estimated 1–2 billion miles per month feeding the training pipeline, Tesla is processing several orders of magnitude more raw miles than either competitor. The caveat is data quality: a supervised consumer driving mile, where the human driver may intervene before the model encounters the full difficulty of a scenario, is not the same as a driverless commercial mile where the autonomous system must resolve the situation on its own.

Waymo’s driverless quality miles are fewer but arguably higher signal. Every Waymo commercial mile is a genuine test of full autonomy — no safety driver masking edge cases, no human override before the hard part. Waymo’s argument is that depth beats breadth for the edge cases that actually matter for safety.

Section 3 — Waymo Carcraft: the simulation edge

Waymo’s Carcraft simulation platform is the company’s answer to Tesla’s data volume advantage. Running an estimated 25,000 virtual cars simultaneously, Carcraft re-simulates every real-world disengagement and edge case in thousands of controlled variations. For every real incident in the Waymo fleet, Carcraft generates a family of simulated variants — different weather, different pedestrian timing, different vehicle speeds — and tests the model’s response to each before any OTA update is approved to ship.

The strategic value of this approach is forward coverage: Carcraft can test scenarios that have never occurred in the real world. An ice storm in Phoenix, an unexpected pedestrian darting from a construction zone, a vehicle running a red light at a specific angle — Waymo can simulate and train on these before encountering them commercially. Tesla’s training pipeline, by contrast, requires encountering an edge case in the real world before training data for that scenario exists.

This is not a clear win for either approach. Waymo’s simulated scenarios are only as good as the simulation fidelity — a sensor model that does not accurately represent real-world LiDAR return can produce training signal that misleads the model in deployment. Tesla’s real-world data is noisy but genuine. The question is whether simulation fidelity has improved enough to close that gap.

Section 4 — Tesla’s shadow mode moat

Tesla runs FSD in shadow mode on vehicles where the owner has not purchased FSD. The neural net silently processes camera feeds and records what it would have done, without any intervention in the actual driving. When the shadow mode prediction diverges from the human driver’s actual behavior, those moments become high-value training examples — the model saw something different from what the human decided, which is exactly the type of signal needed to identify model weaknesses.

Shadow mode effectively extends Tesla’s training fleet beyond the 6 million active vehicles to include the majority of the installed base running HW3 or HW4 hardware. The data collection requires no paid subscription, no special enrollment, and adds no marginal cost per training example. No other AV company has access to a comparable passive data collection mechanism at this scale.

The limitation is the same as the broader Tesla data quality question: shadow mode captures what the human driver decided, not what the optimal autonomous response would have been. If a human driver made a suboptimal decision — braking unnecessarily, taking a wide turn, failing to anticipate a merge — shadow mode captures that suboptimal decision as training signal. The model learns to match human behavior, not to exceed it in the edge cases where human behavior is imperfect.

Section 5 — OTA as a competitive moat

The ability to update the live fleet’s software continuously is itself a competitive advantage that compounds over time. A company with weekly OTA cycles improves approximately 13 times faster than a company with quarterly cycles, holding all other factors equal.

Capability	Tesla	Waymo	Significance
Consumer OTA (non-commercial fleet)	Yes, weekly	No consumer fleet exists	Tesla iterates with 6M users simultaneously
Commercial ops OTA	Yes	Yes, staged and conservative	Waymo prioritizes ops safety over update speed
Rollback capability	Yes	Yes	Both can revert problematic software versions
A/B testing at scale	Yes — millions of vehicles per experiment	Limited — hundreds of vehicles	Tesla can run statistically significant population-scale experiments
Hardware compute OTA	HW4 features unlockable via software	Fixed hardware configuration	Tesla can activate new capabilities on existing deployed hardware

The compounding effect matters here. A faster update cycle means each iteration of the improvement loop — data collected, model trained, update shipped, new data collected — completes sooner. Over a 24-month horizon, the company running weekly cycles has executed the loop roughly 100 times; the company running quarterly cycles has executed it 8 times. This is before accounting for the feedback acceleration that larger fleets create.

Section 6 — Where the advantage actually lies

Neither approach dominates cleanly. Tesla’s data volume, shadow mode coverage, and OTA cadence create a flywheel that is difficult to replicate from a standing start. The weakness is verification: with a fleet of 6 million vehicles and weekly software updates, the cost of a validation failure is very high. Waymo’s slower cadence is partly a deliberate choice — a commercially driverless service cannot afford the reputational cost of shipping a regression to the paying public.

The mid-2026 picture is one of two different bets on where the binding constraint lies. Tesla is betting that data volume and update speed are the binding constraint, and that the edge case coverage problem solves itself with enough real-world miles. Waymo is betting that quality of simulation and driverless mile depth are the binding constraint, and that its smaller but higher-signal dataset is sufficient to build a system that is genuinely safer per mile.

Which bet is correct will become visible in the safety data over the next 24–36 months.

Benchmark context: this is the twelfth article in the physical AI series

This tracker is the twelfth in a series covering physical AI from multiple angles:

Operational ramp metrics — production counts, deployment scale, miles driven
Humanoid robot technology — hardware generations, dexterity benchmarks, foundation model capabilities
AV safety and regulation — California DMV data, NHTSA crash reporting, state permit maps
Investment and valuation — capital flows, funding rounds, implied valuations
Compute and silicon — inference chips, training clusters, NVIDIA supply constraints
Sensor stack and perception architecture — Tesla vision vs. Waymo LiDAR
Robotaxi unit economics — break-even fleet sizes, cost-per-mile projections
Global race — Baidu, WeRide, European AV entrants
Master scorecard — unified ten-dimension competitive comparison
HD mapping and localization — localization architecture and the geographic expansion constraint
Fleet operations and remote assistance — teleoperator ratios and the human-in-the-loop scaling constraint
Software stack and OTA pipeline — this article

The improvement velocity question is not answerable from public data alone. Tesla’s training miles are estimated, not disclosed. Waymo’s simulation throughput is described qualitatively. The OTA cadence figures are approximate. What is clear is the structural logic: the architecture, fleet size, and update cadence that each company has built create very different improvement trajectories, and those trajectories compound.