2026-06-18 — views
Physical AI Software Stack — Waymo's Modular Pipeline vs Tesla's End-to-End Neural Network: The Most Consequential AV Architecture Debate
Waymo uses modular pipeline with interpretable layers; Tesla bets on end-to-end neural nets from 6M-fleet video; both converging toward hybrid architectures.
Article 136 in the Physical AI Benchmark Series — Physical AI Software Stack Architecture: Waymo’s Modular Pipeline vs Tesla’s End-to-End Neural Network, and Why the Stack Choice Is the Most Consequential Technical Decision in AV History
The biggest unresolved debate in autonomous vehicle engineering is not about sensors, maps, or cities — it is about architecture. Should you build a modular pipeline, where separate models handle perception, prediction, and planning with interpretable intermediate outputs at every stage? Or should you build an end-to-end neural network, where raw sensor data flows directly into a network that outputs steering, throttle, and braking commands, trained entirely on video from a real-world fleet? Waymo chose modular. Tesla chose end-to-end. This is not merely a technical preference — it determines safety philosophy, regulatory posture, debugging capability, and ultimately which company can scale faster and to what parts of the world. This is Article 136 in the Physical AI benchmark series.
All figures labeled “(est.)” are derived from public disclosures, research publications, industry analyst estimates, and reasonable inference rather than independently verified primary data.
Section 1 — Waymo’s Modular Stack
Waymo’s software architecture is a layered modular pipeline. Each layer receives the output of the layer below it, processes it using one or more specialized neural networks or rule-based systems, and passes a structured representation upward. The design philosophy is rooted in classical software engineering: separate concerns, test each component independently, and ensure that any failure can be diagnosed at the module level.
| Module | What it does | Technology | Key advantage |
|---|---|---|---|
| Perception | Takes raw sensor data (lidar + camera + radar) and produces a structured world representation: vehicles, pedestrians, cyclists, road markings, traffic signals | Multiple specialized neural networks (one per object class per sensor); sensor fusion combines outputs | Each perception model is individually testable, validatable, and updatable; safety engineers inspect intermediate outputs |
| Prediction | Takes the structured world model from Perception and predicts future trajectories for all agents (where will that pedestrian walk? what will that car do?) | MultiPath++ (Waymo’s published trajectory prediction model); outputs probability distributions over future states | Probabilistic outputs make uncertainty explicit; planners can be risk-aware |
| Planning | Takes predicted trajectories and produces a safe, comfortable driving plan for the Waymo vehicle | MotionCNN + behavior cloning + rule-based safety layers; multiple competing plans generated and scored | Rule-based safety layer = hard constraints the neural net cannot violate (e.g., never cross double yellow) |
| Control | Converts the planning output into precise steering, throttle, and braking commands | Traditional control theory (PID controllers); separable from planning | Predictable, certifiable, inspectable by regulators |
| HD Map | Provides prior knowledge of road structure, lane geometry, traffic signal locations | Waymo’s proprietary HD maps (updated continuously via fleet) | Reduces perceptual uncertainty; lidar can localize against map with centimeter precision |
| Simulation | Tests each module and the full stack in synthetic environments before deployment | Waymo’s Simulation City; NeRF-based scene reconstruction | 1 real mile generates 1,000+ simulated variations (est.) |
| Safety monitor | Independent watchdog that can override all other modules and bring vehicle to safe stop | Rule-based; not neural; designed to be provably correct | Ultimate safety backstop; key to regulatory confidence |
The modular design has a fundamental structural advantage: it creates natural audit points. When a Waymo vehicle makes an unexpected decision, engineers can inspect the perception layer output, verify that the correct objects were detected, then inspect the prediction layer to see what trajectories were forecast for each agent, then inspect the planning layer to understand which plan was selected and why. This is interpretability by architecture — not a feature added on top, but built into the system’s fundamental design.
The safety monitor is perhaps the most consequential single component in Waymo’s stack. It is explicitly not a neural network. It is a rule-based system designed to be provably correct for a defined set of safety-critical conditions. The safety monitor can override every other module — including the planner and the controller — and bring the vehicle to a safe stop if any of its conditions are triggered. This separation of the safety-critical override from the performance-driving neural components is the engineering manifestation of a core principle: for certification and regulatory approval, some behaviors must be guaranteed, not merely probable.
Waymo’s MultiPath++ trajectory prediction model, published in peer-reviewed research, is a good example of the depth of the modular approach. MultiPath++ does not predict a single future trajectory for each agent; it predicts a probability distribution over possible trajectories, quantifying the uncertainty in each prediction. The planner downstream can then reason explicitly about risk — choosing plans that are safe across the full distribution of predicted futures, not just the most likely one. This probabilistic risk-awareness is difficult to achieve in a purely end-to-end system.
Section 2 — Tesla’s End-to-End Stack (FSD v12+)
Tesla’s Full Self-Driving (FSD) version 12 represented a fundamental architectural shift: from a modular system (which early FSD versions used) to an end-to-end neural network. In FSD v12 and beyond, raw video from Tesla’s 8 cameras flows into a neural network that directly outputs a driving plan — no explicit object detection, no explicit trajectory prediction, no hand-coded rules in the critical path. The neural network learns to drive by imitating human drivers from a dataset of billions of miles (est.) of intervention-logged video.
| Component | What it does | Technology | Key advantage |
|---|---|---|---|
| Video tokenizer | Converts 8-camera video feeds into tokens a neural net can process | Tesla’s custom video tokenizer; similar to Vision Transformer (ViT) concept | Processes spatial + temporal context simultaneously; no hand-coded object detection |
| End-to-end neural network | Takes tokenized video (past + present frames) and directly outputs a driving plan (trajectory + velocity profile) | Transformer architecture; trained on 6M+ vehicle fleet data; no intermediate structured representation | Learns driving behaviors engineers could not explicitly code; handles long-tail scenarios via training data scale |
| Occupancy Network | Predicts 3D occupancy of space around the vehicle (what volumes are occupied and likely to be occupied in the future) | Neural occupancy prediction; replaces traditional object detection + tracking | Handles objects that do not fit predefined categories (trash bags, unusual vehicles) |
| Auto-labeling pipeline | Automatically labels fleet video for training (avoids need for human annotators at scale) | Neural labeling models; human review for edge cases | Scales to billions of miles without proportional human annotation cost |
| No HD maps | FSD v12+ does not require pre-built HD maps of the road | Vision-based localization against real-time camera observations | Works in cities Waymo has not mapped; scales geographically without map maintenance cost |
| Dojo training cluster | Trains the end-to-end model at scale | Tesla’s custom D1 chips, ExaPOD clusters (1+ ExaFLOP est.) | Potentially lower training cost per model update than rented H100 clusters (est.) |
| Intervention-based learning | Driver interventions (taking over from FSD) are logged as training signal for edge cases | Supervised learning on human corrections | 6M+ fleet generates enormous volume of intervention data |
The Occupancy Network deserves particular attention. Traditional modular perception systems work by detecting objects from a predefined taxonomy — cars, pedestrians, cyclists, trucks. Anything that does not fit the taxonomy is missed. Tesla’s Occupancy Network sidesteps this entirely: rather than asking “what object is this?”, it asks “is this volume of space occupied and is it likely to move?” A trash bag blowing across the road, a mattress that fell off a truck, an unusual construction vehicle — all of these are handled naturally by occupancy prediction, even without a specific category for them. This is a genuine capability advantage over taxonomy-based perception.
The fleet data flywheel is the most consequential structural advantage in Tesla’s approach. With more than 6 million vehicles on the road generating continuous video, Tesla accumulates an effectively limitless supply of driving data — including the rare, corner-case scenarios that are most difficult to encounter in a small fleet. When a driver takes over from FSD at a difficult intersection at night in rain, that intervention is logged, the video is labeled (automatically via the auto-labeling pipeline), and the correction becomes training data. The next model update incorporates that edge case. Waymo’s much smaller fleet (tens of thousands of vehicles rather than millions) cannot generate comparable edge-case coverage from real-world data volume alone — which is why Waymo invests heavily in simulation to cover scenarios its real-world fleet has not encountered.
Section 3 — Architecture Comparison: Modular vs End-to-End
| Dimension | Waymo (modular) | Tesla (end-to-end) | Verdict |
|---|---|---|---|
| Interpretability | High — each module has inspectable outputs; engineers diagnose failures precisely | Low — “why did it turn left?” is difficult to answer from the neural net’s internal state | Waymo advantage for debugging and regulatory explanation |
| Certifiability | High — rule-based safety layers, separable modules, formal verification possible for components | Low — certifying a black-box neural net is an open research problem | Waymo advantage for formal safety cases |
| Scalability (geographic) | Lower — requires HD map per city (time + cost per city); sensor suite expensive per vehicle | Higher — camera-only, mapless FSD works in any city with roads | Tesla advantage for geographic scale |
| Scalability (edge cases) | Lower — modular systems require explicit engineering for new edge-case categories | Higher — end-to-end learns new behaviors from training data; edge cases handled implicitly at scale | Tesla advantage if fleet data is sufficient |
| Development speed | Slower — changing one module requires validating interactions with all others | Faster — retrain the whole model; improvements show up automatically | Tesla advantage for iteration speed |
| Failure modes | Predictable — each module has defined failure modes; safety monitor catches module failures | Less predictable — novel input distributions can cause unexpected outputs | Waymo advantage; critical for safety |
| Sensor cost | High — lidar + camera + radar per vehicle; $5,000-15,000+ sensor cost (est.) | Low — cameras only; hardware cost minimal | Tesla cost advantage |
| Map maintenance cost | High — continuous map updates required per city | Zero — no maps to maintain | Tesla advantage at scale |
| Current state of art | Waymo’s modular system is the proven driverless commercial approach today | Tesla FSD v12/v13 end-to-end is the fastest-improving supervised driving system today | Both are state of art in their respective deployment regimes |
The most important dimension in this comparison is failure modes. Waymo’s modular architecture produces predictable, diagnosable failures: if the perception module misclassifies an object, engineers can observe the misclassification in the intermediate output and fix the specific model responsible. Tesla’s end-to-end architecture produces less predictable failures: because there are no intermediate structured representations, a model encountering a novel input distribution may produce an output that is surprising in ways that are difficult to anticipate or diagnose without extensive testing. This is not a hypothetical concern — it is the central challenge of deploying neural systems in safety-critical applications, and it is the core reason that no end-to-end neural network has yet received regulatory approval for fully driverless commercial service.
The geographic scalability dimension is equally critical, but for commercial rather than safety reasons. Waymo’s requirement to build and maintain HD maps of every city it operates in is not merely a cost burden — it is a geographic bottleneck. Waymo has mapped a limited set of cities (San Francisco, Phoenix, Los Angeles, Austin, and a small number of others). Expanding to a new city requires mapping, validating the map, and often obtaining local regulatory approval before commercial operations can begin. Tesla’s mapless approach means FSD v12 can, in principle, operate in any city with roads without a city-specific preparation phase. For a global consumer vehicle company with vehicles in 50+ countries, this is a structural advantage with enormous commercial implications.
Section 4 — The Convergence Thesis
The most important insight from observing both architectures in 2025-2026 is that they are converging. Neither pure modular nor pure end-to-end appears to be the long-run answer. Both companies are adding elements of the opposite architecture to their own.
| Trend | Evidence | Implication |
|---|---|---|
| Industry convergence toward end-to-end | Waymo, Mobileye, and other modular-stack companies are adding neural end-to-end components to their modular pipelines (hybrid approach) | End-to-end may be the long-run winner; modular companies are hedging toward it |
| Tesla adding structured outputs | Tesla’s Occupancy Network and lane prediction add structure on top of the end-to-end output — partial convergence toward modular concepts | Hybrid architectures may outperform pure versions of either |
| Academic consensus shifting | Papers from major AV research groups increasingly use end-to-end architectures; Waymo’s own research papers show end-to-end experiments | Academic momentum is with end-to-end, which eventually flows into industry |
| LLM-based planning emerging | Companies like Wayve and early experiments at major labs are using Large Language Models as planners (reading scene descriptions and outputting driving plans) | LLM planners could supersede both modular and traditional end-to-end; Waymo and Tesla both experimenting |
| Imitation vs reinforcement learning | Current end-to-end systems (including Tesla) are primarily imitation learning (copy human drivers); RL-trained systems can exceed human behavior | Tesla and Waymo both exploring RL; RL may be the next step change |
Waymo’s addition of end-to-end neural components to its modular pipeline is the most significant architectural signal of the past two years. Waymo has published research on end-to-end driving experiments and has described a hybrid approach in which end-to-end components handle the “interesting” parts of driving while rule-based components maintain the safety constraints. This is not an admission that modular was wrong — it is an acknowledgment that the long run may favor a hybrid that combines the interpretability and safety guarantees of modular with the learning efficiency and edge-case coverage of end-to-end.
Tesla’s addition of the Occupancy Network and structured lane predictions represents the mirror movement: adding structure to what was designed to be a pure end-to-end system. The Occupancy Network provides a partially interpretable intermediate representation — engineers can inspect the 3D occupancy map to understand what the model “sees” even without traditional object categories. This is a step toward the interpretability that modular systems provide by design.
The emergence of LLM-based planners is the wildcard that could render both the current modular and end-to-end paradigms obsolete. Systems like Wayve’s GAIA-1 use language models to reason about driving scenarios in natural language — reading a description of what the sensors show and generating a driving response in natural language that is then translated to control commands. This approach potentially combines the interpretability of natural language (you can ask the planner why it made a decision and receive a human-readable answer) with the generalization of foundation models trained on internet-scale data. Neither Waymo nor Tesla has yet deployed an LLM-based planner in a production commercial system, but both are actively researching the approach.
Section 5 — Software Stack Benchmark Scorecard
| Dimension | Waymo | Tesla | Edge |
|---|---|---|---|
| Current driverless reliability | Proven — 150,000+ rides per week, 10M+ driverless miles (est.) | Not yet driverless (supervised FSD only) | Waymo |
| Interpretability and debuggability | High (modular) | Low (end-to-end black box) | Waymo |
| Geographic scalability | Lower (HD maps required per city) | Higher (mapless FSD) | Tesla |
| Edge case learning speed | Slower (needs engineering + retraining) | Faster (fleet data to retrain to deploy) | Tesla |
| Regulatory certifiability | Higher (rule-based layers, inspectable modules) | Lower (neural net certification unsolved) | Waymo |
| Sensor cost per vehicle | High (~$5K-15K lidar+camera+radar est.) | Low (cameras only) | Tesla |
| Architecture trajectory | Converging toward hybrid (adding end-to-end components) | Converging toward hybrid (adding structured outputs) | Tie — both heading toward hybrid |
| Long-term winner | Uncertain — modular wins on safety explainability; end-to-end wins on scalability; hybrid may be the answer | — | Open question; the most important unresolved debate in Physical AI |
The scorecard reveals a fundamental tension that the AV industry has not yet resolved. Waymo is ahead on every dimension that matters most for safety certification and regulatory approval today — interpretability, certifiability, predictable failure modes, and proven commercial driverless operation. Tesla is ahead on every dimension that matters most for commercial scale at speed — geographic reach, edge-case learning, iteration velocity, and hardware cost per vehicle.
The technology trajectory suggests that these advantages will converge: as hybrid architectures mature, end-to-end systems will become more interpretable through auxiliary structured outputs and better explanation tools; as neural certification methods advance, the regulatory advantage of rule-based safety layers may shrink. The open question is timing — whether interpretable end-to-end systems will be certifiable before Waymo’s modular system can match Tesla’s fleet-data learning efficiency.
One final observation: the entire framing of “Waymo vs Tesla” obscures the fact that both companies may be pointing toward the same destination via different routes. Waymo’s hybrid experiments and Tesla’s structured outputs are not divergence — they are convergence. The AV architecture debate of the 2020s may ultimately be remembered not as a battle between two irreconcilable paradigms, but as the decade in which the industry learned what a hybrid architecture needed to look like by building the extremes first and discovering what each was missing.
Note: All figures labeled “(est.)” are derived from public disclosures, research publications, analyst estimates, and industry reports as of mid-2026. This article does not constitute investment advice.
Sources
- Waymo MultiPath++ trajectory prediction — Waymo Research ↗
- Tesla FSD v12 end-to-end architecture — Tesla AI Day ↗
- Tesla Occupancy Network — Tesla AI ↗
- Waymo simulation infrastructure — Waymo Research ↗
- End-to-end autonomous driving survey — arXiv ↗