2026-06-17 — views
AV Software Stack & OTA Pipeline Index — Who Improves Fastest in the Field (Mid-2026)
OTA cadence, simulation depth, and the field-data flywheel that determines how fast Tesla, Waymo, and Baidu actually improve in deployment.
The question behind the benchmarks
Miles driven, disengagement rates, and permit counts measure where an AV system stands today. The question that matters more for long-term competitive positioning is a different one: how fast does the system get better? Improvement velocity is a product of three multiplied factors — data volume, compute throughput, and OTA update cadence — and the three players at the frontier have made fundamentally different architectural bets on each factor. This is the twelfth article in the physical AI benchmark series.
Section 1 — Software architecture comparison
The foundational design choices made at the architecture level cascade into every downstream capability: how data is labeled, how models are trained, how updates are validated, and how quickly a new software version can ship to the live fleet.
| Dimension | Tesla FSD | Waymo Driver | Baidu Apollo | Notes |
|---|---|---|---|---|
| Core architecture | End-to-end neural net (camera to action) | Modular (perception + prediction + planning) | Modular (perception + prediction + planning) | Tesla bets on learned policy; others keep explicit planning |
| Sensor input | 8 cameras only | LiDAR + camera + radar | LiDAR + camera + radar | Single-modality vs sensor fusion |
| Simulation platform | Tesla Dojo + internal sim | Waymo Simulation (Carcraft) | AADS simulator | Waymo’s Carcraft reportedly runs ~25,000 virtual cars simultaneously |
| OTA update frequency | Weekly to monthly (consumer FSD) | Quarterly or longer (ops fleet) | Not publicly disclosed | Tesla pushes more frequently than either competitor |
| Fleet size for OTA testing | 6M+ vehicles | ~1,500 vehicles | ~1,000 vehicles | Tesla A/B tests at population scale |
| Training data pipeline | Fleet to Dojo to model to OTA | Ops fleet to Google TPU to model to staged rollout | Ops fleet to Baidu cloud | Tesla’s loop is fastest due to fleet scale |
| Shadow mode testing | Yes — FSD runs silently on non-FSD vehicles | Limited to ops fleet | Not publicly disclosed | Tesla harvests data from non-paying vehicles at no marginal cost |
Reading the table: The architectural fork between end-to-end and modular is not simply a technical preference — it determines the speed and cost of iteration. An end-to-end system can improve on a new edge case by retraining on more data; a modular system requires identifying which module failed, relabeling that module’s training set, retraining the module, and re-validating the full stack. Tesla’s architecture is faster to iterate on by design. Waymo’s modular architecture is more interpretable and easier to audit for safety regulators.
Section 2 — The improvement velocity equation
Three factors multiply together to determine how fast a player’s system improves in the field:
- Data volume — miles of real-world driving entering the training pipeline per month
- Compute — training throughput measured by the infrastructure available to process that data
- Update cadence — how often improved models reach the live fleet and begin generating new training data
| Metric | Tesla | Waymo | Baidu |
|---|---|---|---|
| Miles per month into training pipeline | 1–2 billion (est.) | 5–10 million (est.) | 20–50 million (est.) |
| Training compute (relative) | High — Dojo cluster plus NVIDIA GPUs | Medium — Google TPU fleet | Medium — Baidu cloud |
| OTA cadence | Weekly | Quarterly or longer | Not publicly disclosed |
| Shadow mode coverage | 6M+ vehicles | None | None |
| Driverless quality miles | Lower — mostly supervised consumer driving | High — all commercially driverless | High — driverless in designated cities |
Tesla’s data volume advantage is real and large. At an estimated 1–2 billion miles per month feeding the training pipeline, Tesla is processing several orders of magnitude more raw miles than either competitor. The caveat is data quality: a supervised consumer driving mile, where the human driver may intervene before the model encounters the full difficulty of a scenario, is not the same as a driverless commercial mile where the autonomous system must resolve the situation on its own.
Waymo’s driverless quality miles are fewer but arguably higher signal. Every Waymo commercial mile is a genuine test of full autonomy — no safety driver masking edge cases, no human override before the hard part. Waymo’s argument is that depth beats breadth for the edge cases that actually matter for safety.
Section 3 — Waymo Carcraft: the simulation edge
Waymo’s Carcraft simulation platform is the company’s answer to Tesla’s data volume advantage. Running an estimated 25,000 virtual cars simultaneously, Carcraft re-simulates every real-world disengagement and edge case in thousands of controlled variations. For every real incident in the Waymo fleet, Carcraft generates a family of simulated variants — different weather, different pedestrian timing, different vehicle speeds — and tests the model’s response to each before any OTA update is approved to ship.
The strategic value of this approach is forward coverage: Carcraft can test scenarios that have never occurred in the real world. An ice storm in Phoenix, an unexpected pedestrian darting from a construction zone, a vehicle running a red light at a specific angle — Waymo can simulate and train on these before encountering them commercially. Tesla’s training pipeline, by contrast, requires encountering an edge case in the real world before training data for that scenario exists.
This is not a clear win for either approach. Waymo’s simulated scenarios are only as good as the simulation fidelity — a sensor model that does not accurately represent real-world LiDAR return can produce training signal that misleads the model in deployment. Tesla’s real-world data is noisy but genuine. The question is whether simulation fidelity has improved enough to close that gap.
Section 4 — Tesla’s shadow mode moat
Tesla runs FSD in shadow mode on vehicles where the owner has not purchased FSD. The neural net silently processes camera feeds and records what it would have done, without any intervention in the actual driving. When the shadow mode prediction diverges from the human driver’s actual behavior, those moments become high-value training examples — the model saw something different from what the human decided, which is exactly the type of signal needed to identify model weaknesses.
Shadow mode effectively extends Tesla’s training fleet beyond the 6 million active vehicles to include the majority of the installed base running HW3 or HW4 hardware. The data collection requires no paid subscription, no special enrollment, and adds no marginal cost per training example. No other AV company has access to a comparable passive data collection mechanism at this scale.
The limitation is the same as the broader Tesla data quality question: shadow mode captures what the human driver decided, not what the optimal autonomous response would have been. If a human driver made a suboptimal decision — braking unnecessarily, taking a wide turn, failing to anticipate a merge — shadow mode captures that suboptimal decision as training signal. The model learns to match human behavior, not to exceed it in the edge cases where human behavior is imperfect.
Section 5 — OTA as a competitive moat
The ability to update the live fleet’s software continuously is itself a competitive advantage that compounds over time. A company with weekly OTA cycles improves approximately 13 times faster than a company with quarterly cycles, holding all other factors equal.
| Capability | Tesla | Waymo | Significance |
|---|---|---|---|
| Consumer OTA (non-commercial fleet) | Yes, weekly | No consumer fleet exists | Tesla iterates with 6M users simultaneously |
| Commercial ops OTA | Yes | Yes, staged and conservative | Waymo prioritizes ops safety over update speed |
| Rollback capability | Yes | Yes | Both can revert problematic software versions |
| A/B testing at scale | Yes — millions of vehicles per experiment | Limited — hundreds of vehicles | Tesla can run statistically significant population-scale experiments |
| Hardware compute OTA | HW4 features unlockable via software | Fixed hardware configuration | Tesla can activate new capabilities on existing deployed hardware |
The compounding effect matters here. A faster update cycle means each iteration of the improvement loop — data collected, model trained, update shipped, new data collected — completes sooner. Over a 24-month horizon, the company running weekly cycles has executed the loop roughly 100 times; the company running quarterly cycles has executed it 8 times. This is before accounting for the feedback acceleration that larger fleets create.
Section 6 — Where the advantage actually lies
Neither approach dominates cleanly. Tesla’s data volume, shadow mode coverage, and OTA cadence create a flywheel that is difficult to replicate from a standing start. The weakness is verification: with a fleet of 6 million vehicles and weekly software updates, the cost of a validation failure is very high. Waymo’s slower cadence is partly a deliberate choice — a commercially driverless service cannot afford the reputational cost of shipping a regression to the paying public.
The mid-2026 picture is one of two different bets on where the binding constraint lies. Tesla is betting that data volume and update speed are the binding constraint, and that the edge case coverage problem solves itself with enough real-world miles. Waymo is betting that quality of simulation and driverless mile depth are the binding constraint, and that its smaller but higher-signal dataset is sufficient to build a system that is genuinely safer per mile.
Which bet is correct will become visible in the safety data over the next 24–36 months.
Benchmark context: this is the twelfth article in the physical AI series
This tracker is the twelfth in a series covering physical AI from multiple angles:
- Operational ramp metrics — production counts, deployment scale, miles driven
- Humanoid robot technology — hardware generations, dexterity benchmarks, foundation model capabilities
- AV safety and regulation — California DMV data, NHTSA crash reporting, state permit maps
- Investment and valuation — capital flows, funding rounds, implied valuations
- Compute and silicon — inference chips, training clusters, NVIDIA supply constraints
- Sensor stack and perception architecture — Tesla vision vs. Waymo LiDAR
- Robotaxi unit economics — break-even fleet sizes, cost-per-mile projections
- Global race — Baidu, WeRide, European AV entrants
- Master scorecard — unified ten-dimension competitive comparison
- HD mapping and localization — localization architecture and the geographic expansion constraint
- Fleet operations and remote assistance — teleoperator ratios and the human-in-the-loop scaling constraint
- Software stack and OTA pipeline — this article
The improvement velocity question is not answerable from public data alone. Tesla’s training miles are estimated, not disclosed. Waymo’s simulation throughput is described qualitatively. The OTA cadence figures are approximate. What is clear is the structural logic: the architecture, fleet size, and update cadence that each company has built create very different improvement trajectories, and those trajectories compound.
Sources
- Tesla FSD end-to-end architecture — Tesla AI Day ↗
- Waymo Carcraft simulation — Waymo technology ↗
- Waymo Driver modular architecture — Waymo research ↗
- Baidu Apollo open platform — Apollo ↗