2026-06-18 — views

Tesla FSD End-to-End Architecture — Inside v12's Neural Net and What Changed from Rules to Learning

Tesla FSD v12 replaced 300,000 lines of rules-based C++ with a single end-to-end neural network trained on billions of supervised driving miles.

Article 50 in the Physical AI Benchmark Series — Architecture Deep Dive

Software architecture defines the ceiling for what an autonomous driving system can become. Article 42 in this series documented Waymo’s modular six-layer stack — a system where perception, world modeling, prediction, planning, and control are explicitly separated, each with defined inputs and outputs. Tesla FSD v12 represents the opposite engineering bet: collapse every one of those layers into a single learned neural network, feed it cameras, and train it on billions of miles of human driving behavior until the network learns to drive by itself. This architectural shift, which Tesla deployed publicly in early 2024, is one of the most consequential engineering decisions in the autonomous vehicle industry since the DARPA Grand Challenge. Understanding it precisely — what changed, how the network works, how it is trained, and what v13 and v14 added — is prerequisite to understanding where the driverless frontier sits today.

All figures marked (est.) are estimates based on publicly available disclosures, engineering analysis, and industry reporting. They have not been independently verified and should be treated as directional rather than precise.

Section 1 — The Architectural Shift: v11 to v12

FSD versions through v11 were modular systems. Perception detected objects and estimated their positions. Lane detection identified road geometry. Path planning computed a feasible trajectory through the scene. A control module converted that trajectory into steering, throttle, and brake commands. Each of these modules was written in C++ with hand-coded logic — rules that engineers specified to handle specific scenarios. Andrej Karpathy, then Tesla’s Director of AI, disclosed at the 2022 AI Day that this codebase had grown to approximately 300,000 lines of C++. The rules-based approach had a fundamental scaling problem: each new edge case required new rules, and edge cases are effectively unbounded on public roads.

FSD v12 replaced this entire pipeline with a single end-to-end neural network. Cameras in. Driving actions out. The table below maps every dimension of that change.

Dimension	FSD v11 and earlier	FSD v12 (end-to-end)
Core approach	Modular: perception, lane detection, path planning, control — separate modules with hand-coded rules	End-to-end: cameras to steering, throttle, and brake via a single learned policy
Lines of code	Approximately 300,000 lines of C++ (Karpathy, 2022 AI Day)	Dramatically fewer — most behavior is learned, not coded (est.)
Training signal	Human labels at each module boundary — object bounding boxes, lane line annotations, etc.	Imitation learning from human driver videos — the policy copies what human drivers do
Generalization	Rules break at edge cases; unusual intersection geometry can fail the hand-coded logic	Neural net generalizes across geometries present in training data
Debugging	Per-module: identify which layer failed — perception, prediction, or planning	Black box: harder to isolate why a specific failure occurred
Improvement mechanism	Engineers write more rules; hard to scale beyond a bounded set of scenarios	More data produces a better policy; scales automatically with fleet size
Rollout	FSD v11 = single stack (highway and urban merged, still rules-based)	FSD v12 = end-to-end neural policy across all driving scenarios

The practical effect of this shift was immediate and visible. Users who had used FSD v11 reported that v12 drove with qualitatively different behavior — smoother, more human-like, better at unprotected left turns and complex intersections — not because engineers added new rules, but because the network had been trained on human drivers executing exactly those scenarios.

Section 2 — How the End-to-End Network Works

Tesla has disclosed the core architecture of FSD v12 at its AI Days and through engineering presentations. The following describes the published components; figures marked (est.) are inferred from public disclosures and engineering analysis.

Inputs

The FSD system uses eight cameras: front, front-left, front-right, rear, rear-left, rear-right, narrow forward, and wide forward. Each camera captures approximately 1.2 megapixels (est.). Critically, the network does not process single frames — it processes video streams, ingesting multiple frames per camera simultaneously to capture motion, depth-from-parallax, and time context that a single image cannot provide. Temporal context is not optional in this architecture; it is structurally required. The network must see how a scene is evolving, not just what it looks like at a single instant.

Radar, present on older Tesla hardware, was de-emphasized as FSD moved toward camera primacy. Ultrasonic sensors were removed from new production vehicles in some markets. FSD v12 is effectively a camera-only system at the inference layer.

Architecture: Occupancy Network and Neural Planner

Component	Function
Video Encoder	Processes the multi-camera video stream and produces a spatial-temporal feature representation — the “occupancy network,” a 3D grid encoding which spaces are occupied and which are free
World Model	The occupancy network implicitly models 3D world geometry, other vehicles, pedestrians, and dynamic scene elements — not as labeled objects, but as learned spatial patterns
Neural Planner	Takes the encoded world representation and outputs a trajectory — a sequence of waypoints for the vehicle to follow
Controller	Converts waypoints into steering angle, throttle, and brake commands at the actuator level

The critical insight of v12 is that the boundary between world modeling and planning is not explicit. In Waymo’s six-layer stack, each boundary is a designed interface. In Tesla’s end-to-end network, the separation between “understanding the scene” and “deciding what to do” is implicit in the learned representation. The network decides what matters for driving by observing what human drivers attend to when they act. There is no semantic labeling requirement; the network finds its own scene representation through gradient descent on driving behavior.

Section 3 — Training: Imitation Learning at Fleet Scale

The architectural shift from rules to learning required a corresponding shift in how the system is trained. Supervised learning of individual modules needed labeled bounding boxes, lane annotations, and explicit semantic maps — all of which required human annotators reviewing video frame by frame. FSD v12’s end-to-end training does not require this. The training signal is human driving behavior: what steering angle, throttle level, and brake pressure the human driver applied at each moment.

Training component	Detail
Data source	Video from more than 6 million Tesla vehicles with FSD engaged; human driver actions are the supervision signal
Label type	Human driving actions — steering, throttle, brake — not object bounding boxes or lane line annotations
Scale	Billions of video frames; millions of driving clips (est.)
Data curation	Shadow mode runs the FSD policy in parallel with the human driver, without taking control, and identifies clips where the policy would have diverged from human behavior; these edge cases are prioritized in training
Compute	Dojo supercomputer plus NVIDIA H100 clusters; Tesla has not disclosed total training compute budget (est. billions of dollars in aggregate)
Validation	Real-world disengagement rate; simulation regression tests; closed-course testing

The scaling advantage of this approach is structural. Every Tesla vehicle that drives with a human at the wheel and FSD in shadow mode generates training data automatically. There is no human annotator bottleneck. As Tesla’s fleet drives more miles, the training dataset grows proportionally, and the policy improves. This is the “data flywheel” that Tesla’s AI team has described as a central competitive moat: the more vehicles on the road, the more data; the more data, the better the policy; the better the policy, the more people use FSD; the more people use FSD, the more vehicles generate training data.

Section 4 — v13 and v14: What Changed After v12

FSD v12 established that end-to-end learning could work for supervised autonomous driving. Subsequent versions have refined specific weaknesses and extended the geographic envelope.

Version	Key improvement	When
v12.3	First public end-to-end release; significant quality improvement over v11 in urban driving scenarios; major reduction in phantom braking	Early 2024
v12.5	Improved intersection handling; further phantom braking reduction; highway merge improvements	Mid-2024
v13	Multi-trip memory — vehicle learns specific routes with repeated use; improved highway merge behavior; disengagement rate reduced approximately 30–50% versus v12 (est.)	Late 2024
v13.2	Expanded geographic coverage across additional US states; limited Canada deployment; pedestrian and cyclist handling improvements	Early 2025
v14 (est.)	Highway generalization improvements; continued urban quality gains; Europe limited rollout preparation	2025–2026 (est.)

The disengagement rate trend across FSD versions reflects the impact of the architectural shift. Estimates are based on Tesla public disclosures and California DMV autonomous vehicle report data; direct version-to-version comparison is complicated by changes in driver engagement requirements and reporting methodology.

Era	Est. critical disengagements per 1,000 miles	Notes
v11 era	Approximately 0.09 (est.)	Rules-based system; reported in CA DMV filings
v12 era	Approximately 0.05 (est.)	First end-to-end deployment; major reduction
v13 era	Approximately 0.03 (est.)	Continued improvement on end-to-end foundation
Human driver equivalent	Approximately 0.002 (est.)	Based on NHTSA data; not directly comparable to FSD metric

The gap between v13’s approximately 0.03 and human performance at approximately 0.002 remains roughly one order of magnitude. This gap defines the core open question for the industry: does the end-to-end approach, continued at scale, close that gap entirely — or does it plateau before reaching the one-in-billion-mile reliability required for truly unsupervised robotaxi deployment?

Section 5 — End-to-End vs. Modular: The Unresolved Debate

Tesla’s v12 architecture proves that end-to-end imitation learning produces a capable supervised driving policy — FSD has improved dramatically across every measurable dimension since the transition. But the question of whether it scales to unsupervised driverless operation at verified safety levels is not yet resolved. The debate between Tesla’s approach and Waymo’s modular architecture is the central intellectual argument in autonomous vehicle engineering today.

Claim	Tesla’s bet	Waymo’s counter
Scale to safety	More supervised miles plus a better model will produce emergent safe behavior across all scenarios	Safety at the driverless level requires formal verification, not statistical improvement
Generalization	An end-to-end network trained on enough diverse scenarios generalizes to new environments	A modular system with HD maps and explicit constraints provides hard behavioral bounds that neural nets cannot
Interpretability	Interpretability is not required if the system demonstrably works at scale	Interpretability is required for regulatory certification, liability attribution, and systematic failure investigation
Data efficiency	Billions of supervised miles from the consumer fleet compensate for the absence of purpose-built robotaxi data	Quality driverless miles and targeted simulation are more efficient than unsupervised consumer-fleet data

Neither position is obviously wrong. Tesla’s architecture has produced the faster improvement trajectory on supervised driving metrics. Waymo’s architecture has produced the demonstrated driverless commercial service with the stronger verified safety record. These are not yet directly comparable achievements — Tesla has not operated a fully driverless commercial service at scale, and Waymo has not demonstrated a consumer-facing supervised driving product approaching FSD’s usability.

What the comparison clarifies is the nature of the bet each company has made: Tesla is betting that scale and architecture collapse converge on safety. Waymo is betting that explicit structure and verification are prerequisites to safety that scale alone cannot substitute. By 2027 or 2028, at current development trajectories, there will be enough data on both sides of this bet to evaluate it empirically — which is a more interesting outcome than any prediction made today.

Sources: Tesla AI Day 2022 FSD architecture overview (tesla.com/AI); California DMV autonomous vehicle disengagement reports (dmv.ca.gov); Andrej Karpathy Tesla AI Day 2021 (youtu.be/j0z4FweCy4M); Tesla FSD version release notes (tesla.com/support/car-software-updates). All figures marked (est.) are estimates based on publicly available data, engineering analysis, and industry reporting; they have not been independently verified and may differ from primary source data.