Skip to content
AI-Daily-Builder

2026-06-18 views

Physical AI Software Stack — Waymo's Modular Pipeline vs Tesla's End-to-End Neural Network: The Most Consequential AV Architecture Debate

Waymo uses modular pipeline with interpretable layers; Tesla bets on end-to-end neural nets from 6M-fleet video; both converging toward hybrid architectures.

Article 136 in the Physical AI Benchmark Series — Physical AI Software Stack Architecture: Waymo’s Modular Pipeline vs Tesla’s End-to-End Neural Network, and Why the Stack Choice Is the Most Consequential Technical Decision in AV History

The biggest unresolved debate in autonomous vehicle engineering is not about sensors, maps, or cities — it is about architecture. Should you build a modular pipeline, where separate models handle perception, prediction, and planning with interpretable intermediate outputs at every stage? Or should you build an end-to-end neural network, where raw sensor data flows directly into a network that outputs steering, throttle, and braking commands, trained entirely on video from a real-world fleet? Waymo chose modular. Tesla chose end-to-end. This is not merely a technical preference — it determines safety philosophy, regulatory posture, debugging capability, and ultimately which company can scale faster and to what parts of the world. This is Article 136 in the Physical AI benchmark series.

All figures labeled “(est.)” are derived from public disclosures, research publications, industry analyst estimates, and reasonable inference rather than independently verified primary data.


Section 1 — Waymo’s Modular Stack

Waymo’s software architecture is a layered modular pipeline. Each layer receives the output of the layer below it, processes it using one or more specialized neural networks or rule-based systems, and passes a structured representation upward. The design philosophy is rooted in classical software engineering: separate concerns, test each component independently, and ensure that any failure can be diagnosed at the module level.

ModuleWhat it doesTechnologyKey advantage
PerceptionTakes raw sensor data (lidar + camera + radar) and produces a structured world representation: vehicles, pedestrians, cyclists, road markings, traffic signalsMultiple specialized neural networks (one per object class per sensor); sensor fusion combines outputsEach perception model is individually testable, validatable, and updatable; safety engineers inspect intermediate outputs
PredictionTakes the structured world model from Perception and predicts future trajectories for all agents (where will that pedestrian walk? what will that car do?)MultiPath++ (Waymo’s published trajectory prediction model); outputs probability distributions over future statesProbabilistic outputs make uncertainty explicit; planners can be risk-aware
PlanningTakes predicted trajectories and produces a safe, comfortable driving plan for the Waymo vehicleMotionCNN + behavior cloning + rule-based safety layers; multiple competing plans generated and scoredRule-based safety layer = hard constraints the neural net cannot violate (e.g., never cross double yellow)
ControlConverts the planning output into precise steering, throttle, and braking commandsTraditional control theory (PID controllers); separable from planningPredictable, certifiable, inspectable by regulators
HD MapProvides prior knowledge of road structure, lane geometry, traffic signal locationsWaymo’s proprietary HD maps (updated continuously via fleet)Reduces perceptual uncertainty; lidar can localize against map with centimeter precision
SimulationTests each module and the full stack in synthetic environments before deploymentWaymo’s Simulation City; NeRF-based scene reconstruction1 real mile generates 1,000+ simulated variations (est.)
Safety monitorIndependent watchdog that can override all other modules and bring vehicle to safe stopRule-based; not neural; designed to be provably correctUltimate safety backstop; key to regulatory confidence

The modular design has a fundamental structural advantage: it creates natural audit points. When a Waymo vehicle makes an unexpected decision, engineers can inspect the perception layer output, verify that the correct objects were detected, then inspect the prediction layer to see what trajectories were forecast for each agent, then inspect the planning layer to understand which plan was selected and why. This is interpretability by architecture — not a feature added on top, but built into the system’s fundamental design.

The safety monitor is perhaps the most consequential single component in Waymo’s stack. It is explicitly not a neural network. It is a rule-based system designed to be provably correct for a defined set of safety-critical conditions. The safety monitor can override every other module — including the planner and the controller — and bring the vehicle to a safe stop if any of its conditions are triggered. This separation of the safety-critical override from the performance-driving neural components is the engineering manifestation of a core principle: for certification and regulatory approval, some behaviors must be guaranteed, not merely probable.

Waymo’s MultiPath++ trajectory prediction model, published in peer-reviewed research, is a good example of the depth of the modular approach. MultiPath++ does not predict a single future trajectory for each agent; it predicts a probability distribution over possible trajectories, quantifying the uncertainty in each prediction. The planner downstream can then reason explicitly about risk — choosing plans that are safe across the full distribution of predicted futures, not just the most likely one. This probabilistic risk-awareness is difficult to achieve in a purely end-to-end system.


Section 2 — Tesla’s End-to-End Stack (FSD v12+)

Tesla’s Full Self-Driving (FSD) version 12 represented a fundamental architectural shift: from a modular system (which early FSD versions used) to an end-to-end neural network. In FSD v12 and beyond, raw video from Tesla’s 8 cameras flows into a neural network that directly outputs a driving plan — no explicit object detection, no explicit trajectory prediction, no hand-coded rules in the critical path. The neural network learns to drive by imitating human drivers from a dataset of billions of miles (est.) of intervention-logged video.

ComponentWhat it doesTechnologyKey advantage
Video tokenizerConverts 8-camera video feeds into tokens a neural net can processTesla’s custom video tokenizer; similar to Vision Transformer (ViT) conceptProcesses spatial + temporal context simultaneously; no hand-coded object detection
End-to-end neural networkTakes tokenized video (past + present frames) and directly outputs a driving plan (trajectory + velocity profile)Transformer architecture; trained on 6M+ vehicle fleet data; no intermediate structured representationLearns driving behaviors engineers could not explicitly code; handles long-tail scenarios via training data scale
Occupancy NetworkPredicts 3D occupancy of space around the vehicle (what volumes are occupied and likely to be occupied in the future)Neural occupancy prediction; replaces traditional object detection + trackingHandles objects that do not fit predefined categories (trash bags, unusual vehicles)
Auto-labeling pipelineAutomatically labels fleet video for training (avoids need for human annotators at scale)Neural labeling models; human review for edge casesScales to billions of miles without proportional human annotation cost
No HD mapsFSD v12+ does not require pre-built HD maps of the roadVision-based localization against real-time camera observationsWorks in cities Waymo has not mapped; scales geographically without map maintenance cost
Dojo training clusterTrains the end-to-end model at scaleTesla’s custom D1 chips, ExaPOD clusters (1+ ExaFLOP est.)Potentially lower training cost per model update than rented H100 clusters (est.)
Intervention-based learningDriver interventions (taking over from FSD) are logged as training signal for edge casesSupervised learning on human corrections6M+ fleet generates enormous volume of intervention data

The Occupancy Network deserves particular attention. Traditional modular perception systems work by detecting objects from a predefined taxonomy — cars, pedestrians, cyclists, trucks. Anything that does not fit the taxonomy is missed. Tesla’s Occupancy Network sidesteps this entirely: rather than asking “what object is this?”, it asks “is this volume of space occupied and is it likely to move?” A trash bag blowing across the road, a mattress that fell off a truck, an unusual construction vehicle — all of these are handled naturally by occupancy prediction, even without a specific category for them. This is a genuine capability advantage over taxonomy-based perception.

The fleet data flywheel is the most consequential structural advantage in Tesla’s approach. With more than 6 million vehicles on the road generating continuous video, Tesla accumulates an effectively limitless supply of driving data — including the rare, corner-case scenarios that are most difficult to encounter in a small fleet. When a driver takes over from FSD at a difficult intersection at night in rain, that intervention is logged, the video is labeled (automatically via the auto-labeling pipeline), and the correction becomes training data. The next model update incorporates that edge case. Waymo’s much smaller fleet (tens of thousands of vehicles rather than millions) cannot generate comparable edge-case coverage from real-world data volume alone — which is why Waymo invests heavily in simulation to cover scenarios its real-world fleet has not encountered.


Section 3 — Architecture Comparison: Modular vs End-to-End

DimensionWaymo (modular)Tesla (end-to-end)Verdict
InterpretabilityHigh — each module has inspectable outputs; engineers diagnose failures preciselyLow — “why did it turn left?” is difficult to answer from the neural net’s internal stateWaymo advantage for debugging and regulatory explanation
CertifiabilityHigh — rule-based safety layers, separable modules, formal verification possible for componentsLow — certifying a black-box neural net is an open research problemWaymo advantage for formal safety cases
Scalability (geographic)Lower — requires HD map per city (time + cost per city); sensor suite expensive per vehicleHigher — camera-only, mapless FSD works in any city with roadsTesla advantage for geographic scale
Scalability (edge cases)Lower — modular systems require explicit engineering for new edge-case categoriesHigher — end-to-end learns new behaviors from training data; edge cases handled implicitly at scaleTesla advantage if fleet data is sufficient
Development speedSlower — changing one module requires validating interactions with all othersFaster — retrain the whole model; improvements show up automaticallyTesla advantage for iteration speed
Failure modesPredictable — each module has defined failure modes; safety monitor catches module failuresLess predictable — novel input distributions can cause unexpected outputsWaymo advantage; critical for safety
Sensor costHigh — lidar + camera + radar per vehicle; $5,000-15,000+ sensor cost (est.)Low — cameras only; hardware cost minimalTesla cost advantage
Map maintenance costHigh — continuous map updates required per cityZero — no maps to maintainTesla advantage at scale
Current state of artWaymo’s modular system is the proven driverless commercial approach todayTesla FSD v12/v13 end-to-end is the fastest-improving supervised driving system todayBoth are state of art in their respective deployment regimes

The most important dimension in this comparison is failure modes. Waymo’s modular architecture produces predictable, diagnosable failures: if the perception module misclassifies an object, engineers can observe the misclassification in the intermediate output and fix the specific model responsible. Tesla’s end-to-end architecture produces less predictable failures: because there are no intermediate structured representations, a model encountering a novel input distribution may produce an output that is surprising in ways that are difficult to anticipate or diagnose without extensive testing. This is not a hypothetical concern — it is the central challenge of deploying neural systems in safety-critical applications, and it is the core reason that no end-to-end neural network has yet received regulatory approval for fully driverless commercial service.

The geographic scalability dimension is equally critical, but for commercial rather than safety reasons. Waymo’s requirement to build and maintain HD maps of every city it operates in is not merely a cost burden — it is a geographic bottleneck. Waymo has mapped a limited set of cities (San Francisco, Phoenix, Los Angeles, Austin, and a small number of others). Expanding to a new city requires mapping, validating the map, and often obtaining local regulatory approval before commercial operations can begin. Tesla’s mapless approach means FSD v12 can, in principle, operate in any city with roads without a city-specific preparation phase. For a global consumer vehicle company with vehicles in 50+ countries, this is a structural advantage with enormous commercial implications.


Section 4 — The Convergence Thesis

The most important insight from observing both architectures in 2025-2026 is that they are converging. Neither pure modular nor pure end-to-end appears to be the long-run answer. Both companies are adding elements of the opposite architecture to their own.

TrendEvidenceImplication
Industry convergence toward end-to-endWaymo, Mobileye, and other modular-stack companies are adding neural end-to-end components to their modular pipelines (hybrid approach)End-to-end may be the long-run winner; modular companies are hedging toward it
Tesla adding structured outputsTesla’s Occupancy Network and lane prediction add structure on top of the end-to-end output — partial convergence toward modular conceptsHybrid architectures may outperform pure versions of either
Academic consensus shiftingPapers from major AV research groups increasingly use end-to-end architectures; Waymo’s own research papers show end-to-end experimentsAcademic momentum is with end-to-end, which eventually flows into industry
LLM-based planning emergingCompanies like Wayve and early experiments at major labs are using Large Language Models as planners (reading scene descriptions and outputting driving plans)LLM planners could supersede both modular and traditional end-to-end; Waymo and Tesla both experimenting
Imitation vs reinforcement learningCurrent end-to-end systems (including Tesla) are primarily imitation learning (copy human drivers); RL-trained systems can exceed human behaviorTesla and Waymo both exploring RL; RL may be the next step change

Waymo’s addition of end-to-end neural components to its modular pipeline is the most significant architectural signal of the past two years. Waymo has published research on end-to-end driving experiments and has described a hybrid approach in which end-to-end components handle the “interesting” parts of driving while rule-based components maintain the safety constraints. This is not an admission that modular was wrong — it is an acknowledgment that the long run may favor a hybrid that combines the interpretability and safety guarantees of modular with the learning efficiency and edge-case coverage of end-to-end.

Tesla’s addition of the Occupancy Network and structured lane predictions represents the mirror movement: adding structure to what was designed to be a pure end-to-end system. The Occupancy Network provides a partially interpretable intermediate representation — engineers can inspect the 3D occupancy map to understand what the model “sees” even without traditional object categories. This is a step toward the interpretability that modular systems provide by design.

The emergence of LLM-based planners is the wildcard that could render both the current modular and end-to-end paradigms obsolete. Systems like Wayve’s GAIA-1 use language models to reason about driving scenarios in natural language — reading a description of what the sensors show and generating a driving response in natural language that is then translated to control commands. This approach potentially combines the interpretability of natural language (you can ask the planner why it made a decision and receive a human-readable answer) with the generalization of foundation models trained on internet-scale data. Neither Waymo nor Tesla has yet deployed an LLM-based planner in a production commercial system, but both are actively researching the approach.


Section 5 — Software Stack Benchmark Scorecard

DimensionWaymoTeslaEdge
Current driverless reliabilityProven — 150,000+ rides per week, 10M+ driverless miles (est.)Not yet driverless (supervised FSD only)Waymo
Interpretability and debuggabilityHigh (modular)Low (end-to-end black box)Waymo
Geographic scalabilityLower (HD maps required per city)Higher (mapless FSD)Tesla
Edge case learning speedSlower (needs engineering + retraining)Faster (fleet data to retrain to deploy)Tesla
Regulatory certifiabilityHigher (rule-based layers, inspectable modules)Lower (neural net certification unsolved)Waymo
Sensor cost per vehicleHigh (~$5K-15K lidar+camera+radar est.)Low (cameras only)Tesla
Architecture trajectoryConverging toward hybrid (adding end-to-end components)Converging toward hybrid (adding structured outputs)Tie — both heading toward hybrid
Long-term winnerUncertain — modular wins on safety explainability; end-to-end wins on scalability; hybrid may be the answerOpen question; the most important unresolved debate in Physical AI

The scorecard reveals a fundamental tension that the AV industry has not yet resolved. Waymo is ahead on every dimension that matters most for safety certification and regulatory approval today — interpretability, certifiability, predictable failure modes, and proven commercial driverless operation. Tesla is ahead on every dimension that matters most for commercial scale at speed — geographic reach, edge-case learning, iteration velocity, and hardware cost per vehicle.

The technology trajectory suggests that these advantages will converge: as hybrid architectures mature, end-to-end systems will become more interpretable through auxiliary structured outputs and better explanation tools; as neural certification methods advance, the regulatory advantage of rule-based safety layers may shrink. The open question is timing — whether interpretable end-to-end systems will be certifiable before Waymo’s modular system can match Tesla’s fleet-data learning efficiency.

One final observation: the entire framing of “Waymo vs Tesla” obscures the fact that both companies may be pointing toward the same destination via different routes. Waymo’s hybrid experiments and Tesla’s structured outputs are not divergence — they are convergence. The AV architecture debate of the 2020s may ultimately be remembered not as a battle between two irreconcilable paradigms, but as the decade in which the industry learned what a hybrid architecture needed to look like by building the extremes first and discovering what each was missing.

Note: All figures labeled “(est.)” are derived from public disclosures, research publications, analyst estimates, and industry reports as of mid-2026. This article does not constitute investment advice.


Sources

Tags

Tip