2026-06-18 — views
Tesla Dojo Supercomputer — Custom AI Silicon and the Compute Scaling Thesis Behind FSD
Tesla's Dojo D1 silicon powers FSD and Optimus training — the bet that faster throughput compounds into better autonomous driving.
Article 111 in the Physical AI Benchmark Series — Tesla Dojo Supercomputer Deep-Dive: Custom AI Training Silicon, the Compute Scaling Thesis Behind FSD, and How Training Throughput Determines How Fast Autonomous Driving Improves
Training compute is the hidden variable in the autonomous driving race. Everyone watches disengagement rates, robotaxi launches, and safety reports — the visible outputs of the race. But the engine underneath those outputs is training compute: how much data a company can process, how quickly it can run experiments, and how fast it can iterate on the neural network policies that actually drive the cars. Tesla’s Dojo supercomputer is the company’s bet that this variable is so decisive that it justifies building custom silicon from scratch rather than renting GPU time from NVIDIA or using Google’s TPUs.
This is a distinct strategic posture from every other company in the AV space. Waymo uses Google TPUs and NVIDIA GPU clusters — chips designed for general-purpose AI workloads, rented or purchased from established suppliers. Tesla decided that the FSD training workload is specific enough, and the competitive advantage of owning the training compute stack is large enough, that building custom silicon optimized specifically for video training is worth the cost, the engineering complexity, and the multi-year timeline. Understanding why Tesla made that bet, whether the bet is paying off, and what the observable signals are that will confirm or refute it is essential for anyone tracking the physical AI ramp.
Section 1 — Why Training Compute Matters for FSD
The connection between training compute and FSD performance is not intuitive on the surface. FSD runs on a chip inside each Tesla vehicle — the HW4 onboard computer — and that chip does all the real-time inference that steers the car. Dojo is not in the car. Dojo is in a data center. But what Dojo does — training the neural network weights that are eventually deployed to HW4 — determines the quality ceiling of every FSD version.
| Principle | Explanation | FSD implication |
|---|---|---|
| Scaling laws | Neural network performance improves predictably with more compute, more data, and larger models (Chinchilla scaling laws; OpenAI scaling paper) | If FSD follows scaling laws, more training compute = better driving policy — same principle as LLMs getting smarter with more compute |
| Video is compute-hungry | Training on raw camera video (1280x960 x 8 cameras x 36Hz per Tesla’s disclosed spec) generates enormous data volumes; video tokens are expensive to process | FSD v12+ is trained end-to-end on video; training one model iteration requires processing billions of frames |
| Iteration speed | Faster training compute = more experiments per unit time = faster improvement cycle | Teams that can run 10x more experiments find better model architectures faster |
| Data flywheel x compute flywheel | Tesla’s data advantage (6M+ vehicles) only compounds if compute can keep up with data ingestion rate | Without sufficient compute, the data flywheel slows — collected data sits unprocessed |
| Inference vs training | Dojo is for training (finding model weights); each Tesla vehicle uses its onboard HW4 chip for inference (running the model in real time) | Two separate compute problems: Dojo (massive, centralized training) vs HW4 (efficient, distributed inference) |
The scaling law argument is the core of the Dojo thesis. It is empirically established in the LLM world: GPT-4 is better than GPT-3 not because OpenAI found a radically different architecture, but because they trained a much larger model on much more data with much more compute. The question for Tesla is whether the same law holds for autonomous driving — whether more training compute on more video data reliably produces a better driving policy. Tesla’s leadership has explicitly stated this belief, and FSD v12’s end-to-end architecture is the implementation of it.
The video-compute demand is not an abstraction. A single Tesla vehicle with 8 cameras recording at 36Hz generates roughly 290 frames per second. Across a fleet of 6 million vehicles, the data collection rate is staggering. But collection is not the bottleneck — processing is. Running a gradient descent step on a batch of video clips, across an enormous neural network, for millions of iterations, requires compute infrastructure at a scale that commodity GPU rental cannot easily deliver at the cost structure Tesla needs to make the economics work.
Section 2 — Dojo D1 Chip Architecture
The Dojo D1 chip is the atom of Tesla’s custom silicon strategy. Tesla disclosed the key specifications at AI Day 2022. Understanding the architecture requires understanding what problem Tesla was optimizing for: not general-purpose AI computation, but specifically the video training workload.
| Specification | Value | Context |
|---|---|---|
| Process node | TSMC 7nm | Same node as some NVIDIA A100 production runs; not latest node but optimized for cost/density |
| Compute per chip | ~362 TFLOPS BF16 (disclosed) | Comparable to NVIDIA A100 (312 TFLOPS BF16); D1 optimized for bandwidth efficiency |
| On-chip memory | 50MB SRAM (disclosed) | Much larger SRAM than GPU designs; reduces memory bandwidth bottleneck for video training |
| Memory bandwidth | ~10 TB/s chip-to-chip interconnect within a training tile (disclosed) | Key differentiator: D1 chips connect to each other at extremely high bandwidth within a tile; eliminates NVLink-equivalent bottleneck |
| Tile structure | 25 D1 chips per training tile; 120 training tiles per ExaPOD (disclosed) | ExaPOD: 3,000 D1 chips, ~1.1 ExaFLOP BF16 compute |
| ExaPOD spec | ~1.1 ExaFLOP BF16 (disclosed target) | One ExaPOD = ~1 ExaFLOP; multiple ExaPODs in production (est.) |
| Key design philosophy | Eliminate the CPU-GPU memory hierarchy bottleneck; D1 is a unified compute fabric where chips communicate peer-to-peer at extremely high bandwidth | Traditional GPU training is bottlenecked by CPU→GPU data transfer and NVLink bandwidth; D1 bypasses this for video workloads |
The 50MB on-chip SRAM figure deserves specific attention. Standard GPU architectures use DRAM as the primary memory pool — fast enough for general AI workloads, but with a fundamental bandwidth ceiling when training on large video clips where adjacent frames must be processed together. The D1’s vastly larger on-chip SRAM keeps more data closer to the compute units, reducing the frequency of expensive off-chip memory accesses. For video training, where temporal coherence across frames is critical, this architectural choice directly translates to training efficiency.
The tile interconnect bandwidth — ~10 TB/s chip-to-chip within a tile — is the second key differentiator. NVIDIA’s NVLink interconnect between A100s runs at roughly 600 GB/s bidirectional. The D1’s intra-tile bandwidth is roughly 8x that figure within the tile. When training a neural network on video data, the communication between compute nodes during backpropagation is a major bottleneck. Higher interconnect bandwidth directly reduces the fraction of time compute units spend waiting for gradient synchronization.
The ExaPOD is the deployable unit: 25 D1 chips per tile, 120 tiles per ExaPOD, giving 3,000 D1 chips and approximately 1.1 ExaFLOP of BF16 compute per ExaPOD. For context, an ExaFLOP is 10^18 floating point operations per second — a scale that was the domain of national supercomputing facilities as recently as 2022.
Section 3 — Dojo vs NVIDIA GPU Cluster Comparison
The case for Dojo is not that D1 is a better chip than H100 in absolute terms. NVIDIA’s H100 is an exceptional piece of silicon with a mature software ecosystem and broad applicability. The case for Dojo is that owning a vertically integrated training stack — chip, software, training pipeline, all optimized for one workload — produces a strategic advantage that cannot be replicated by renting H100s, even if the per-FLOP compute specs are comparable.
| Dimension | Tesla Dojo (D1 / ExaPOD) | NVIDIA H100/H200 cluster |
|---|---|---|
| Hardware ownership | Custom silicon; Tesla owns the full stack (chip → software → training pipeline) | Third-party; pay per GPU or buy hardware; NVIDIA controls the roadmap |
| Video training efficiency | Optimized specifically for video (large SRAM, high chip-to-chip bandwidth); advantage for FSD workload (est.) | General purpose; excellent for transformer training; video training works but not specifically optimized |
| Software stack | Tesla-proprietary; no CUDA compatibility; requires custom ML framework | CUDA ecosystem; PyTorch / JAX / TF all have optimized CUDA backends; vast tooling |
| Capital cost | Very high upfront (building custom silicon fab, packaging, infrastructure) | Rental or purchase; OpEx-friendly; H100 ~$30K-$40K/unit (est.) |
| Flexibility | Dojo optimized for Tesla’s specific workload; harder to repurpose | H100 cluster can run any workload; repurposable |
| Scale ceiling | Limited by Tesla’s own buildout pace; ExaPOD production rate | NVIDIA can supply essentially unlimited H100s at current demand level (est.) |
| Vendor risk | Tesla controls supply; no vendor dependency | Subject to NVIDIA pricing, allocation priorities, export controls |
| Current capacity | Multiple ExaPODs operational; exact capacity not disclosed; Tesla has stated Dojo is in production training use (est.) | Waymo uses Google TPUs (Alphabet-internal) + NVIDIA GPUs (est.) |
The software stack point is the most underappreciated element of this comparison. CUDA has a three-decade head start. Every major ML framework has optimized CUDA backends maintained by teams of experts. Every paper that benchmarks a new training technique uses CUDA. Every researcher who joins Tesla from a university has spent years on CUDA. Tesla’s decision to build custom silicon that is not CUDA-compatible means building and maintaining a parallel software stack, attracting engineers willing to work outside the CUDA ecosystem, and implementing every training optimization from scratch rather than inheriting them from the PyTorch community.
This is an enormous software cost that does not show up in chip specifications. The question is whether the hardware advantage is large enough to justify that software overhead. Tesla’s leadership has said yes — but the answer will only be definitively knowable when external observers can compare FSD improvement rates on Dojo-trained models versus what the same data and training budget would have produced on NVIDIA hardware.
Section 4 — HW4: Inference at the Edge
Dojo trains the models. HW4 runs them. The two compute problems are separated by the deployment pipeline: training produces model weights, weights are compressed and optimized for inference, and the resulting model is shipped to vehicles via over-the-air update. HW4 is what executes the model in real time while the car is moving.
| Specification | HW4 (Tesla’s current onboard chip) | HW3 (predecessor) |
|---|---|---|
| TOPS (Tera Operations Per Second) | ~720 TOPS (disclosed) | 144 TOPS |
| Improvement | ~5x vs HW3 | — |
| Process node | TSMC 4nm (est.) | Samsung 14nm |
| Cameras supported | Up to 8 cameras at full resolution | 8 cameras (same) |
| Network bandwidth | Ethernet-based sensor network (vs CAN bus on older designs) | CAN bus |
| FSD version | HW4 required for FSD v12+ end-to-end (est.); HW3 runs older FSD versions | Runs FSD up to v11 (est.) |
| HW4 fleet penetration | All new Tesla vehicles since ~2023 have HW4; HW3 fleet still large (est.) | HW3 vehicles are an upgrade challenge — hardware retrofit required for full FSD v12+ benefit |
| Cost | Not disclosed separately; part of vehicle manufacturing cost | — |
The HW3-to-HW4 transition reveals a structural challenge in the AV industry that is not specific to Tesla: the onboard inference hardware determines what FSD versions a vehicle can run. HW3 vehicles cannot run FSD v12+ at full capability because the model is larger than HW3 can execute at real-time frame rates (est.). This means the entire HW3 fleet — every Tesla sold before approximately 2023 — is running an older, less capable FSD version regardless of how much Dojo-powered improvement happens in training.
The 5x TOPS improvement (144 to ~720 TOPS) is the gap that FSD v12’s end-to-end architecture requires to close. The shift from a modular FSD architecture (where individual components handle specific tasks like lane detection, object classification, and path planning) to an end-to-end architecture (where a single large neural network takes in raw camera frames and outputs driving commands) demands significantly more inference compute. HW4 provides that compute; HW3 does not, at least not at the model sizes FSD v12 uses.
Section 5 — Dojo as a Benchmark Signal
For the Physical AI benchmark series, Dojo is not just a chip — it is a set of observable signals that reveal whether Tesla’s compute scaling thesis is working. The signals are specific, trackable, and accumulate over time.
| Signal | What to watch | Why it matters |
|---|---|---|
| ExaPOD count | How many ExaPODs are operational and training FSD | Direct proxy for training compute available; more ExaPODs = faster model iteration |
| Training run frequency | How often Tesla ships a new FSD version | FSD update cadence (weekly/monthly/quarterly) reflects training throughput |
| Disengagement rate trend | Critical disengagement rate per 1,000 miles over time | If Dojo scaling law thesis is correct, disengagement rate should continue declining as compute scales |
| Dojo vs cloud cost | Whether Dojo delivers better cost/FLOP than renting NVIDIA H100s | If Dojo is more expensive than cloud at scale, the custom silicon bet fails economically |
| HW4 fleet penetration | % of Tesla FSD fleet on HW4 | HW4 vehicles get the most capable FSD; HW3 vehicles are compute-constrained at inference |
| Optimus training integration | Whether Dojo is also training Optimus policies (generalist robot) | If Dojo trains both FSD and Optimus, compute allocation becomes a strategic variable |
The most actionable of these signals is FSD update cadence. If Dojo is producing training throughput at the scale Tesla claims, the frequency of FSD model updates should be measurable. Weekly updates would indicate a functioning high-throughput training pipeline. Quarterly updates would suggest either that the training pipeline is the bottleneck, or that deployment cycles are gated by something other than compute.
The Optimus integration signal is the longest-horizon one but potentially the most revealing strategically. Tesla has publicly stated that Dojo is being used to train both FSD and Optimus policies. If that is correct, Dojo is not just an AV training system — it is the foundation for Tesla’s entire physical AI ambition. The compute budget allocation between FSD and Optimus training, and how that allocation shifts over time, will reveal Tesla’s prioritization between its two large physical AI bets.
The disengagement rate trajectory is the lagging indicator that ultimately validates or refutes the thesis. The leading indicators — ExaPOD count, FSD update frequency, HW4 fleet penetration — tell you whether the compute inputs are in place. The disengagement rate tells you whether those inputs are translating into a better driving policy at the pace the scaling thesis predicts.
Section 6 — Strategic Context: What Dojo Means for the AV Competitive Landscape
The Dojo investment cannot be evaluated in isolation. It is a strategic choice that reveals how Tesla thinks about the AV race relative to competitors — and that thinking has implications for every company in the physical AI space.
The fundamental bet is that autonomous driving is a training compute problem more than it is a data collection problem, a sensor problem, or a mapping problem. Waymo has excellent maps, excellent sensor fusion, and access to Google’s compute resources. But Waymo’s training loop is slower because its data collection scale (hundreds of vehicles versus millions) is fundamentally smaller. If training compute and data volume are the primary determinants of FSD quality, Waymo’s sensor advantage is insufficient to close the gap.
Tesla’s alternative hypothesis — that camera-only sensing is sufficient for AV if trained on enough data with enough compute — is the architectural expression of this belief. Lidar sensors provide high-fidelity 3D point clouds, but they are expensive, they require HDR mapping, and they generate data that is structurally different from the data Tesla’s neural network is trained on. If camera-only, compute-scaled training produces a driving policy that outperforms lidar-assisted systems trained on less data, the Dojo investment is vindicated.
The timing of that vindication matters. Custom silicon takes years to develop and deploy. Tesla broke ground on Dojo when NVIDIA’s A100 was the state of the art; by the time Dojo reached production scale, NVIDIA had shipped H100 and was previewing H200 and Blackwell. The question is not whether D1 is better than H100 in isolation — it probably is not, on general metrics. The question is whether the D1-based system, with its video-specific architecture and Tesla’s custom software stack running on it, delivers better training throughput per dollar for Tesla’s specific workload than renting H100s would. That is an empirical question that Tesla’s training results over the next 12-24 months will answer.
Section 7 — What to Watch in 2026 and Beyond
The observable signals that will reveal whether Dojo is delivering on its thesis are accumulating now. The Physical AI benchmark series will track these signals as they develop.
| Signal | Timing | What it reveals |
|---|---|---|
| ExaPOD count disclosures | Quarterly earnings context (Tesla IR) | Whether Tesla is expanding Dojo capacity at the pace the thesis requires |
| FSD v12+ monthly release rate | Ongoing | Training throughput proxy; more frequent releases = more Dojo cycles per unit time |
| HW4 fleet percentage | Vehicle delivery reports (quarterly) | What fraction of FSD subscribers can actually run the latest end-to-end model |
| Disengagement rate trajectory | CA DMV annual report (est. year-end) + Tesla voluntary data | The lagging indicator that validates or refutes the scaling thesis |
| Dojo ExaFLOP capacity | Tesla AI/product events (est.) | Total Dojo training capacity; compare to Alphabet and Waymo compute disclosures |
| Optimus policy training confirmation | Tesla events; earnings calls | Whether Dojo compute is split between FSD and Optimus, and how |
| NVIDIA exposure reduction | Tesla capex disclosures | Whether Dojo is genuinely replacing NVIDIA GPU rental or supplementing it |
The 2026 signal environment is particularly rich because Tesla has publicly committed to significant Dojo expansion. If ExaPOD count increases and FSD update cadence does not accelerate, the bottleneck is not compute but something else — data pipeline, model architecture, or the scaling law itself not applying cleanly to driving. If compute expands and FSD cadence increases but disengagement rates plateau, the thesis that compute directly translates to driving quality is weaker than claimed.
Each of these outcomes teaches something specific about the physical AI ramp — not just for Tesla, but for every company in the space. The Dojo experiment is running at scale, with real vehicles, in real traffic, with observable outputs. That makes it one of the most informative experiments in the history of autonomous vehicles.
Note: Figures labeled “(est.)” are directional estimates based on publicly available information as of mid-2026. Dojo capacity, ExaPOD counts, and training compute details are not fully publicly disclosed by Tesla. This article does not constitute investment advice.
Sources
- Tesla AI Day 2022 — Dojo D1 chip presentation — Tesla ↗
- Tesla HW4 autopilot computer — Tesla ↗
- Chinchilla scaling laws — Hoffmann et al. 2022 — arXiv ↗
- NVIDIA H100 specifications — NVIDIA ↗
- Tesla Q1 2026 earnings — compute infrastructure disclosures — Tesla IR ↗