2026-06-18 — views
Tesla Dojo vs. Cloud Compute — The Build-vs-Buy Decision Behind FSD and Optimus Training
How Tesla's custom Dojo cluster compares to renting H100/B200 cloud compute — architecture, economics, and strategic implications for FSD and Optimus.
Article 34 in the Physical AI Benchmark Series — AI Training Infrastructure Analysis
Tesla is building one of the most ambitious custom AI training clusters in the world. Dojo — Tesla’s purpose-built supercomputer — represents a fundamental bet that owning compute infrastructure at scale is cheaper, faster, and more strategically defensible than renting it from Amazon, Google, or Microsoft. This article examines that bet in detail: what Dojo is, how it compares to renting NVIDIA H100 or B200 clusters from the major cloud providers, the build-vs-buy economics, and what Dojo means for Tesla’s long-term AI training cost structure for both FSD and Optimus.
Section 1 — Dojo Architecture Overview
Dojo is a ground-up Tesla-designed training system, not a derivative of any existing vendor architecture. The unit of composition starts at the chip level and scales through tiles, ExaPODs, and eventually multi-ExaPOD clusters.
| Component | Specification |
|---|---|
| Custom chip | D1 (Dojo 1) — 7nm TSMC process, 50 TFLOPS BF16 performance, approximately 10 kW per chip, 900 GB/s memory bandwidth |
| Training tile | 25 D1 chips on one tile; approximately 2 PFLOPS per tile |
| ExaPOD | 120 training tiles per ExaPOD; approximately 100 PFLOPS total; fits in one cabinet |
| Target cluster scale | Multiple ExaPODs; Tesla targeting approximately 1 EFLOP (exaFLOP) of training capacity by end-2025 / 2026 (est.) |
| Interconnect | Custom high-bandwidth D1-to-D1 direct links — avoids the PCIe bandwidth bottleneck that limits GPU-to-GPU communication in conventional clusters |
| Primary use case | End-to-end FSD neural network training; Optimus robot policy training. NOT used for inference (inference runs on Tesla’s FSD Hardware in vehicles). |
| Hybrid approach | Tesla also rents NVIDIA A100/H100 clusters from cloud providers for burst training workloads alongside Dojo |
Why the interconnect matters: Standard GPU clusters communicate chip-to-chip over PCIe or NVLink, creating bandwidth bottlenecks that limit how tightly training jobs can be parallelized. Dojo’s D1-to-D1 links are designed around the specific communication patterns of Tesla’s training workloads — primarily large video data batches for FSD perception models. The architecture trades general-purpose flexibility for optimized throughput on these specific workload types.
Scale context: One EFLOP (exaFLOP) represents 10^18 floating-point operations per second. To put this in perspective, the US government’s Frontier supercomputer — the world’s first publicly confirmed exascale computer — was also designed around this scale. Tesla is attempting to reach exascale training capacity using custom silicon rather than off-the-shelf hardware.
Section 2 — Build vs. Buy Economics
The financial comparison between Dojo and cloud NVIDIA compute is not straightforward. The outcome depends heavily on utilization rates, time horizon, and which cost components are included. All figures below are estimates based on publicly available data and industry analysis.
| Metric | Tesla Dojo (build) | Cloud NVIDIA H100 (buy) | Notes |
|---|---|---|---|
| Capital cost per ExaPOD (est.) | $300M–$500M+ (est.) | $0 upfront | Dojo requires massive capex investment; cloud converts capex to opex |
| Operational cost per PFLOP-day (est.) | $0.05–$0.15 (est., at scale) | $0.50–$2.00 (cloud spot/on-demand, est.) | Dojo cost advantage materializes only at high utilization |
| Break-even utilization (est.) | 60–80% (est.) | N/A | Below this threshold, cloud is cheaper per unit of compute delivered |
| Flexibility | Low — fixed architecture, difficult to upgrade mid-generation | High — rent latest NVIDIA silicon (B200/Blackwell) within days of availability | |
| Latency to new hardware | 3–5 years per chip generation | Days — cloud adds newest NVIDIA silicon as it becomes available | |
| Data security | High — Tesla training data never leaves Tesla-controlled infrastructure | Medium — cloud providers offer contractual protections, but data traverses shared infrastructure | |
| Vendor risk | Tesla-controlled — no dependency on NVIDIA pricing or availability for training capacity | Exposed to NVIDIA pricing power and hardware availability cycles | |
| Break-even point (est.) | 4–6 years of heavy utilization (est.) | N/A — pay-as-you-go model with no fixed payback period |
Reading the economics: The key variable is utilization. At 80%+ utilization sustained over four or more years, Dojo’s per-PFLOP cost falls well below cloud rates. At 40% utilization or below, the amortized capex cost per unit of compute delivered likely exceeds what Tesla would pay renting H100 clusters on demand. This makes Dojo’s economics inherently tied to Tesla’s ability to generate training workloads at scale — which is itself tied to FSD rollout speed, Optimus production volume, and the continued growth of Tesla’s labeled driving data corpus.
The cloud pricing comparison above reflects H100 on-demand and spot rates observed in 2024–2025. NVIDIA Blackwell (B200) cloud availability is expanding through 2026 and may shift the comparison further, since B200 performance-per-dollar significantly exceeds H100 on current benchmarks.
Section 3 — The Strategic Case for Dojo
Tesla’s stated rationale for Dojo goes beyond raw cost economics. Four strategic arguments are particularly compelling.
1. Data security and IP protection
Tesla’s FSD training data — billions of miles of labeled driving video from the global Tesla fleet — is among the most competitively sensitive proprietary datasets in the technology sector. Routing this data through cloud providers introduces IP and competitive intelligence risk, even under contractual NDAs. Training entirely on owned infrastructure eliminates this surface. For a company whose AI moat is fundamentally a data moat, this is not a trivial concern.
2. Custom silicon optimization
NVIDIA GPUs are designed to be general-purpose accelerators across a wide range of workloads. Dojo’s D1 chip is designed specifically for Tesla’s training workload profile: high-throughput video data ingestion, end-to-end neural network training on camera inputs, and large-scale data-parallel training jobs. Custom silicon optimized for a specific workload type can achieve 2–5x better performance-per-watt compared to general-purpose accelerators on that targeted workload (est.) — though this advantage is narrow and does not generalize beyond the intended use case.
3. Vendor independence and supply security
The NVIDIA H100 shortage of 2023–2024 demonstrated the risk of depending on a single-vendor supply chain for critical AI infrastructure. During the shortage period, cloud spot pricing for H100 instances surged 3–5x (est.) relative to pre-shortage baselines. Companies with prior access agreements maintained compute access; those without faced training delays. Dojo provides Tesla with guaranteed compute capacity that scales with Tesla’s own production capacity rather than NVIDIA’s supply allocation decisions.
4. Optimus data flywheel lock-in
As Optimus scales from prototype to mass production, it generates an entirely new category of training data: humanoid robot interaction data, manipulation task demonstrations, and policy feedback signals. Training increasingly capable humanoid policies requires continuous compute at scale. If Optimus reaches 50,000+ units deployed, the data generation rate and associated training compute demand could exceed what FSD training currently requires. Owning the compute layer means Optimus training costs are a function of Tesla’s own silicon economics, not an external vendor’s pricing structure.
5. Potential external revenue stream
Tesla has publicly referenced the possibility of offering Dojo compute capacity as a service to external AI companies. If Dojo reaches exaFLOP-scale and Tesla’s own utilization leaves headroom, selling access to spare capacity represents a new revenue stream in a market where compute scarcity is persistent. This option has no value if Dojo remains underutilized — but at high utilization with overflow demand, it becomes a real business.
Section 4 — The Case Against Dojo (Bear Thesis)
The strategic arguments for Dojo are real, but so are the counterarguments. Four bear-case concerns are worth taking seriously.
1. Opportunity cost of capex
Every dollar deployed in Dojo capex ($300M–$500M+ per ExaPOD, est.) could alternatively fund access to 5–10x more NVIDIA H100 or B200 compute in the short term, because cloud converts capex to opex and the cloud providers achieve economies of scale in hardware procurement that Tesla cannot match at comparable volume. If training velocity — iterations per unit of time — matters more than long-run cost efficiency, cloud may generate faster FSD improvement even at a higher cost-per-PFLOP.
2. Architecture obsolescence
Dojo D1 is fabricated on TSMC’s 7nm node. NVIDIA’s Blackwell B200 is fabricated on TSMC’s 4nm+ node, with performance improvements of approximately 5x over H100 on relevant benchmarks. A chip design cycle for a custom accelerator typically takes 3–5 years from tape-out to production deployment. By the time a Dojo D2 or next-generation custom chip reaches production, NVIDIA may have already shipped two further generations. The risk is that Dojo invests several years of capex and R&D to arrive at a performance level the commercial GPU market has already surpassed.
3. Software ecosystem immaturity
NVIDIA’s CUDA ecosystem has more than 15 years of library development, third-party framework support, and engineering talent depth. PyTorch, TensorFlow, JAX, and virtually every major ML research framework target CUDA as the primary execution backend. Dojo requires a Tesla-custom software stack — proprietary compilers, custom libraries, and bespoke training frameworks. This creates a talent sourcing disadvantage (fewer engineers know the stack), a tooling disadvantage (fewer open-source optimizations available), and a debugging disadvantage (less community knowledge to draw from). These are solvable problems with sufficient engineering investment, but they represent real friction costs.
4. Utilization risk
The economic case for Dojo depends on sustained high utilization over a multi-year payback period. Two scenarios could compress utilization below the break-even threshold: First, if FSD training needs plateau because the model reaches a performance level that is “good enough” for commercial deployment across the majority of use cases, the training compute demand does not grow at the rate needed to keep Dojo fully utilized. Second, if Optimus production ramps more slowly than projected, the anticipated surge in humanoid policy training demand arrives later, leaving Dojo under-loaded in the intervening period. Cloud compute scales down gracefully to near-zero when not needed; Dojo does not.
Section 5 — Dojo Implications for FSD and Optimus Timelines
The practical question for investors and observers is not whether Dojo is theoretically optimal, but whether it meaningfully changes the timelines and cost structures for Tesla’s two most important AI products.
| Milestone | Dojo contribution (est.) | Without Dojo (cloud only) |
|---|---|---|
| FSD v14 to v15 generalization leap | Enables continuous retraining on the full labeled dataset without cloud cost constraints (est.) | Technically possible but estimated 2–3x more expensive at equivalent training scale (est.) |
| Optimus task generalization (10 to 50 tasks) | Dojo capacity supports large-scale humanoid policy training at the data volumes Optimus deployment generates | Bottlenecked by cloud H100 availability and per-hour cost at the required training scale |
| Optimus 50,000-unit training support | Requires approximately 5–10 ExaPODs of continuous training capacity (est.) | Would cost an estimated $500M+ per year on cloud at equivalent compute (est.) |
| Dojo as external compute product | 2027–2028 potential window if utilization permits and capacity is available (est.) | N/A — cloud model does not create this revenue option |
FSD interpretation: The most concrete near-term benefit of Dojo for FSD is removing the cost ceiling on training data utilization. When training on cloud infrastructure, compute cost is a direct function of training hours — which creates financial pressure to limit training runs, reduce batch sizes, or sample the training data rather than training on the full corpus. At Dojo scale with fully amortized capex, the marginal cost of additional training compute approaches zero, potentially enabling more frequent model iterations and more exhaustive use of the available labeled data.
Optimus interpretation: The humanoid training implication is more speculative but potentially larger in magnitude. If Optimus achieves mass production at 50,000–100,000 units per year, each robot generates continuous interaction data that must be incorporated into policy updates. The required training compute scales with both the number of tasks being learned and the number of robots providing feedback data. At that scale, cloud economics become genuinely prohibitive — which makes Dojo’s fixed-cost structure the only viable path to sustaining the Optimus data flywheel at the rate Tesla’s production ambitions imply.
Section 6 — About This Series
This is article 34 in the Physical AI Benchmark Series. Previous articles have covered the ramp index, the humanoid race, unit economics, global competition, HD mapping, fleet operations, software and OTA, insurance and liability, consumer demand, partnerships, competitive moats, Cybercab versus Model Y, safety data, Waymo Gen 6, Optimus manufacturing, scorecard snapshots, the 2030 forecast scenarios, the investor framework, Waymo’s city expansion pipeline, Tesla’s state approval map, AV weather and climate constraints, the talent war, the regulatory calendar, robotaxi fare pricing, the AV data flywheel comparison, the humanoid deployment tracker, the supply chain analysis, the consumer adoption demand index, and the Waymo standalone valuation and IPO analysis.
This article adds the AI training infrastructure dimension: the build-vs-buy decision at the core of Tesla’s compute strategy, the architecture and economics of Dojo versus cloud NVIDIA clusters, and the implications for FSD and Optimus training capacity over the next three to five years. The build-vs-buy question will become increasingly consequential as AI model training costs continue to scale and as humanoid robot deployment creates new categories of training data demand.
Reminder: All cost estimates, performance figures, and timeline projections in this article are estimates based on publicly available information, analyst commentary, and technical presentations. They are not investment recommendations. Conduct your own due diligence and consult a licensed financial adviser before making any investment decisions.
Sources
- Tesla Dojo supercomputer — Tesla AI ↗
- Tesla Dojo D1 chip architecture — Hot Chips 2021 Tesla presentation ↗
- NVIDIA H100/B200 cloud pricing — AWS/GCP ↗
- AI compute cost trends — Epoch AI research ↗