2026-06-18 — views

Physical AI Software Stack — Waymo's Modular Pipeline vs Tesla's End-to-End Neural Network: The Most Consequential AV Architecture Debate

Waymo uses modular pipeline with interpretable layers; Tesla bets on end-to-end neural nets from 6M-fleet video; both converging toward hybrid architectures.

Article 136 in the Physical AI Benchmark Series — Physical AI Software Stack Architecture: Waymo’s Modular Pipeline vs Tesla’s End-to-End Neural Network, and Why the Stack Choice Is the Most Consequential Technical Decision in AV History

The biggest unresolved debate in autonomous vehicle engineering is not about sensors, maps, or cities — it is about architecture. Should you build a modular pipeline, where separate models handle perception, prediction, and planning with interpretable intermediate outputs at every stage? Or should you build an end-to-end neural network, where raw sensor data flows directly into a network that outputs steering, throttle, and braking commands, trained entirely on video from a real-world fleet? Waymo chose modular. Tesla chose end-to-end. This is not merely a technical preference — it determines safety philosophy, regulatory posture, debugging capability, and ultimately which company can scale faster and to what parts of the world. This is Article 136 in the Physical AI benchmark series.

All figures labeled “(est.)” are derived from public disclosures, research publications, industry analyst estimates, and reasonable inference rather than independently verified primary data.

Section 1 — Waymo’s Modular Stack

Waymo’s software architecture is a layered modular pipeline. Each layer receives the output of the layer below it, processes it using one or more specialized neural networks or rule-based systems, and passes a structured representation upward. The design philosophy is rooted in classical software engineering: separate concerns, test each component independently, and ensure that any failure can be diagnosed at the module level.

Module	What it does	Technology	Key advantage
Perception	Takes raw sensor data (lidar + camera + radar) and produces a structured world representation: vehicles, pedestrians, cyclists, road markings, traffic signals	Multiple specialized neural networks (one per object class per sensor); sensor fusion combines outputs	Each perception model is individually testable, validatable, and updatable; safety engineers inspect intermediate outputs
Prediction	Takes the structured world model from Perception and predicts future trajectories for all agents (where will that pedestrian walk? what will that car do?)	MultiPath++ (Waymo’s published trajectory prediction model); outputs probability distributions over future states	Probabilistic outputs make uncertainty explicit; planners can be risk-aware
Planning	Takes predicted trajectories and produces a safe, comfortable driving plan for the Waymo vehicle	MotionCNN + behavior cloning + rule-based safety layers; multiple competing plans generated and scored	Rule-based safety layer = hard constraints the neural net cannot violate (e.g., never cross double yellow)
Control	Converts the planning output into precise steering, throttle, and braking commands	Traditional control theory (PID controllers); separable from planning	Predictable, certifiable, inspectable by regulators
HD Map	Provides prior knowledge of road structure, lane geometry, traffic signal locations	Waymo’s proprietary HD maps (updated continuously via fleet)	Reduces perceptual uncertainty; lidar can localize against map with centimeter precision
Simulation	Tests each module and the full stack in synthetic environments before deployment	Waymo’s Simulation City; NeRF-based scene reconstruction	1 real mile generates 1,000+ simulated variations (est.)
Safety monitor	Independent watchdog that can override all other modules and bring vehicle to safe stop	Rule-based; not neural; designed to be provably correct	Ultimate safety backstop; key to regulatory confidence

The modular design has a fundamental structural advantage: it creates natural audit points. When a Waymo vehicle makes an unexpected decision, engineers can inspect the perception layer output, verify that the correct objects were detected, then inspect the prediction layer to see what trajectories were forecast for each agent, then inspect the planning layer to understand which plan was selected and why. This is interpretability by architecture — not a feature added on top, but built into the system’s fundamental design.

The safety monitor is perhaps the most consequential single component in Waymo’s stack. It is explicitly not a neural network. It is a rule-based system designed to be provably correct for a defined set of safety-critical conditions. The safety monitor can override every other module — including the planner and the controller — and bring the vehicle to a safe stop if any of its conditions are triggered. This separation of the safety-critical override from the performance-driving neural components is the engineering manifestation of a core principle: for certification and regulatory approval, some behaviors must be guaranteed, not merely probable.

Waymo’s MultiPath++ trajectory prediction model, published in peer-reviewed research, is a good example of the depth of the modular approach. MultiPath++ does not predict a single future trajectory for each agent; it predicts a probability distribution over possible trajectories, quantifying the uncertainty in each prediction. The planner downstream can then reason explicitly about risk — choosing plans that are safe across the full distribution of predicted futures, not just the most likely one. This probabilistic risk-awareness is difficult to achieve in a purely end-to-end system.

Section 2 — Tesla’s End-to-End Stack (FSD v12+)

Tesla’s Full Self-Driving (FSD) version 12 represented a fundamental architectural shift: from a modular system (which early FSD versions used) to an end-to-end neural network. In FSD v12 and beyond, raw video from Tesla’s 8 cameras flows into a neural network that directly outputs a driving plan — no explicit object detection, no explicit trajectory prediction, no hand-coded rules in the critical path. The neural network learns to drive by imitating human drivers from a dataset of billions of miles (est.) of intervention-logged video.

Component	What it does	Technology	Key advantage
Video tokenizer	Converts 8-camera video feeds into tokens a neural net can process	Tesla’s custom video tokenizer; similar to Vision Transformer (ViT) concept	Processes spatial + temporal context simultaneously; no hand-coded object detection
End-to-end neural network	Takes tokenized video (past + present frames) and directly outputs a driving plan (trajectory + velocity profile)	Transformer architecture; trained on 6M+ vehicle fleet data; no intermediate structured representation	Learns driving behaviors engineers could not explicitly code; handles long-tail scenarios via training data scale
Occupancy Network	Predicts 3D occupancy of space around the vehicle (what volumes are occupied and likely to be occupied in the future)	Neural occupancy prediction; replaces traditional object detection + tracking	Handles objects that do not fit predefined categories (trash bags, unusual vehicles)
Auto-labeling pipeline	Automatically labels fleet video for training (avoids need for human annotators at scale)	Neural labeling models; human review for edge cases	Scales to billions of miles without proportional human annotation cost
No HD maps	FSD v12+ does not require pre-built HD maps of the road	Vision-based localization against real-time camera observations	Works in cities Waymo has not mapped; scales geographically without map maintenance cost
Dojo training cluster	Trains the end-to-end model at scale	Tesla’s custom D1 chips, ExaPOD clusters (1+ ExaFLOP est.)	Potentially lower training cost per model update than rented H100 clusters (est.)
Intervention-based learning	Driver interventions (taking over from FSD) are logged as training signal for edge cases	Supervised learning on human corrections	6M+ fleet generates enormous volume of intervention data

The Occupancy Network deserves particular attention. Traditional modular perception systems work by detecting objects from a predefined taxonomy — cars, pedestrians, cyclists, trucks. Anything that does not fit the taxonomy is missed. Tesla’s Occupancy Network sidesteps this entirely: rather than asking “what object is this?”, it asks “is this volume of space occupied and is it likely to move?” A trash bag blowing across the road, a mattress that fell off a truck, an unusual construction vehicle — all of these are handled naturally by occupancy prediction, even without a specific category for them. This is a genuine capability advantage over taxonomy-based perception.

The fleet data flywheel is the most consequential structural advantage in Tesla’s approach. With more than 6 million vehicles on the road generating continuous video, Tesla accumulates an effectively limitless supply of driving data — including the rare, corner-case scenarios that are most difficult to encounter in a small fleet. When a driver takes over from FSD at a difficult intersection at night in rain, that intervention is logged, the video is labeled (automatically via the auto-labeling pipeline), and the correction becomes training data. The next model update incorporates that edge case. Waymo’s much smaller fleet (tens of thousands of vehicles rather than millions) cannot generate comparable edge-case coverage from real-world data volume alone — which is why Waymo invests heavily in simulation to cover scenarios its real-world fleet has not encountered.

Section 3 — Architecture Comparison: Modular vs End-to-End

Dimension	Waymo (modular)	Tesla (end-to-end)	Verdict
Interpretability	High — each module has inspectable outputs; engineers diagnose failures precisely	Low — “why did it turn left?” is difficult to answer from the neural net’s internal state	Waymo advantage for debugging and regulatory explanation
Certifiability	High — rule-based safety layers, separable modules, formal verification possible for components	Low — certifying a black-box neural net is an open research problem	Waymo advantage for formal safety cases
Scalability (geographic)	Lower — requires HD map per city (time + cost per city); sensor suite expensive per vehicle	Higher — camera-only, mapless FSD works in any city with roads	Tesla advantage for geographic scale
Scalability (edge cases)	Lower — modular systems require explicit engineering for new edge-case categories	Higher — end-to-end learns new behaviors from training data; edge cases handled implicitly at scale	Tesla advantage if fleet data is sufficient
Development speed	Slower — changing one module requires validating interactions with all others	Faster — retrain the whole model; improvements show up automatically	Tesla advantage for iteration speed
Failure modes	Predictable — each module has defined failure modes; safety monitor catches module failures	Less predictable — novel input distributions can cause unexpected outputs	Waymo advantage; critical for safety
Sensor cost	High — lidar + camera + radar per vehicle; $5,000-15,000+ sensor cost (est.)	Low — cameras only; hardware cost minimal	Tesla cost advantage
Map maintenance cost	High — continuous map updates required per city	Zero — no maps to maintain	Tesla advantage at scale
Current state of art	Waymo’s modular system is the proven driverless commercial approach today	Tesla FSD v12/v13 end-to-end is the fastest-improving supervised driving system today	Both are state of art in their respective deployment regimes

The most important dimension in this comparison is failure modes. Waymo’s modular architecture produces predictable, diagnosable failures: if the perception module misclassifies an object, engineers can observe the misclassification in the intermediate output and fix the specific model responsible. Tesla’s end-to-end architecture produces less predictable failures: because there are no intermediate structured representations, a model encountering a novel input distribution may produce an output that is surprising in ways that are difficult to anticipate or diagnose without extensive testing. This is not a hypothetical concern — it is the central challenge of deploying neural systems in safety-critical applications, and it is the core reason that no end-to-end neural network has yet received regulatory approval for fully driverless commercial service.

The geographic scalability dimension is equally critical, but for commercial rather than safety reasons. Waymo’s requirement to build and maintain HD maps of every city it operates in is not merely a cost burden — it is a geographic bottleneck. Waymo has mapped a limited set of cities (San Francisco, Phoenix, Los Angeles, Austin, and a small number of others). Expanding to a new city requires mapping, validating the map, and often obtaining local regulatory approval before commercial operations can begin. Tesla’s mapless approach means FSD v12 can, in principle, operate in any city with roads without a city-specific preparation phase. For a global consumer vehicle company with vehicles in 50+ countries, this is a structural advantage with enormous commercial implications.

Section 4 — The Convergence Thesis

The most important insight from observing both architectures in 2025-2026 is that they are converging. Neither pure modular nor pure end-to-end appears to be the long-run answer. Both companies are adding elements of the opposite architecture to their own.

Trend	Evidence	Implication
Industry convergence toward end-to-end	Waymo, Mobileye, and other modular-stack companies are adding neural end-to-end components to their modular pipelines (hybrid approach)	End-to-end may be the long-run winner; modular companies are hedging toward it
Tesla adding structured outputs	Tesla’s Occupancy Network and lane prediction add structure on top of the end-to-end output — partial convergence toward modular concepts	Hybrid architectures may outperform pure versions of either
Academic consensus shifting	Papers from major AV research groups increasingly use end-to-end architectures; Waymo’s own research papers show end-to-end experiments	Academic momentum is with end-to-end, which eventually flows into industry
LLM-based planning emerging	Companies like Wayve and early experiments at major labs are using Large Language Models as planners (reading scene descriptions and outputting driving plans)	LLM planners could supersede both modular and traditional end-to-end; Waymo and Tesla both experimenting
Imitation vs reinforcement learning	Current end-to-end systems (including Tesla) are primarily imitation learning (copy human drivers); RL-trained systems can exceed human behavior	Tesla and Waymo both exploring RL; RL may be the next step change

Waymo’s addition of end-to-end neural components to its modular pipeline is the most significant architectural signal of the past two years. Waymo has published research on end-to-end driving experiments and has described a hybrid approach in which end-to-end components handle the “interesting” parts of driving while rule-based components maintain the safety constraints. This is not an admission that modular was wrong — it is an acknowledgment that the long run may favor a hybrid that combines the interpretability and safety guarantees of modular with the learning efficiency and edge-case coverage of end-to-end.

Tesla’s addition of the Occupancy Network and structured lane predictions represents the mirror movement: adding structure to what was designed to be a pure end-to-end system. The Occupancy Network provides a partially interpretable intermediate representation — engineers can inspect the 3D occupancy map to understand what the model “sees” even without traditional object categories. This is a step toward the interpretability that modular systems provide by design.

The emergence of LLM-based planners is the wildcard that could render both the current modular and end-to-end paradigms obsolete. Systems like Wayve’s GAIA-1 use language models to reason about driving scenarios in natural language — reading a description of what the sensors show and generating a driving response in natural language that is then translated to control commands. This approach potentially combines the interpretability of natural language (you can ask the planner why it made a decision and receive a human-readable answer) with the generalization of foundation models trained on internet-scale data. Neither Waymo nor Tesla has yet deployed an LLM-based planner in a production commercial system, but both are actively researching the approach.

Section 5 — Software Stack Benchmark Scorecard

Dimension	Waymo	Tesla	Edge
Current driverless reliability	Proven — 150,000+ rides per week, 10M+ driverless miles (est.)	Not yet driverless (supervised FSD only)	Waymo
Interpretability and debuggability	High (modular)	Low (end-to-end black box)	Waymo
Geographic scalability	Lower (HD maps required per city)	Higher (mapless FSD)	Tesla
Edge case learning speed	Slower (needs engineering + retraining)	Faster (fleet data to retrain to deploy)	Tesla
Regulatory certifiability	Higher (rule-based layers, inspectable modules)	Lower (neural net certification unsolved)	Waymo
Sensor cost per vehicle	High (~$5K-15K lidar+camera+radar est.)	Low (cameras only)	Tesla
Architecture trajectory	Converging toward hybrid (adding end-to-end components)	Converging toward hybrid (adding structured outputs)	Tie — both heading toward hybrid
Long-term winner	Uncertain — modular wins on safety explainability; end-to-end wins on scalability; hybrid may be the answer	—	Open question; the most important unresolved debate in Physical AI

The scorecard reveals a fundamental tension that the AV industry has not yet resolved. Waymo is ahead on every dimension that matters most for safety certification and regulatory approval today — interpretability, certifiability, predictable failure modes, and proven commercial driverless operation. Tesla is ahead on every dimension that matters most for commercial scale at speed — geographic reach, edge-case learning, iteration velocity, and hardware cost per vehicle.

The technology trajectory suggests that these advantages will converge: as hybrid architectures mature, end-to-end systems will become more interpretable through auxiliary structured outputs and better explanation tools; as neural certification methods advance, the regulatory advantage of rule-based safety layers may shrink. The open question is timing — whether interpretable end-to-end systems will be certifiable before Waymo’s modular system can match Tesla’s fleet-data learning efficiency.

One final observation: the entire framing of “Waymo vs Tesla” obscures the fact that both companies may be pointing toward the same destination via different routes. Waymo’s hybrid experiments and Tesla’s structured outputs are not divergence — they are convergence. The AV architecture debate of the 2020s may ultimately be remembered not as a battle between two irreconcilable paradigms, but as the decade in which the industry learned what a hybrid architecture needed to look like by building the extremes first and discovering what each was missing.

Note: All figures labeled “(est.)” are derived from public disclosures, research publications, analyst estimates, and industry reports as of mid-2026. This article does not constitute investment advice.