Skip to content
AI-Daily-Builder

2026-06-18 views

Tesla Dojo vs. Cloud Compute — The Build-vs-Buy Decision Behind FSD and Optimus Training

How Tesla's custom Dojo cluster compares to renting H100/B200 cloud compute — architecture, economics, and strategic implications for FSD and Optimus.

Article 34 in the Physical AI Benchmark Series — AI Training Infrastructure Analysis

Tesla is building one of the most ambitious custom AI training clusters in the world. Dojo — Tesla’s purpose-built supercomputer — represents a fundamental bet that owning compute infrastructure at scale is cheaper, faster, and more strategically defensible than renting it from Amazon, Google, or Microsoft. This article examines that bet in detail: what Dojo is, how it compares to renting NVIDIA H100 or B200 clusters from the major cloud providers, the build-vs-buy economics, and what Dojo means for Tesla’s long-term AI training cost structure for both FSD and Optimus.


Section 1 — Dojo Architecture Overview

Dojo is a ground-up Tesla-designed training system, not a derivative of any existing vendor architecture. The unit of composition starts at the chip level and scales through tiles, ExaPODs, and eventually multi-ExaPOD clusters.

ComponentSpecification
Custom chipD1 (Dojo 1) — 7nm TSMC process, 50 TFLOPS BF16 performance, approximately 10 kW per chip, 900 GB/s memory bandwidth
Training tile25 D1 chips on one tile; approximately 2 PFLOPS per tile
ExaPOD120 training tiles per ExaPOD; approximately 100 PFLOPS total; fits in one cabinet
Target cluster scaleMultiple ExaPODs; Tesla targeting approximately 1 EFLOP (exaFLOP) of training capacity by end-2025 / 2026 (est.)
InterconnectCustom high-bandwidth D1-to-D1 direct links — avoids the PCIe bandwidth bottleneck that limits GPU-to-GPU communication in conventional clusters
Primary use caseEnd-to-end FSD neural network training; Optimus robot policy training. NOT used for inference (inference runs on Tesla’s FSD Hardware in vehicles).
Hybrid approachTesla also rents NVIDIA A100/H100 clusters from cloud providers for burst training workloads alongside Dojo

Why the interconnect matters: Standard GPU clusters communicate chip-to-chip over PCIe or NVLink, creating bandwidth bottlenecks that limit how tightly training jobs can be parallelized. Dojo’s D1-to-D1 links are designed around the specific communication patterns of Tesla’s training workloads — primarily large video data batches for FSD perception models. The architecture trades general-purpose flexibility for optimized throughput on these specific workload types.

Scale context: One EFLOP (exaFLOP) represents 10^18 floating-point operations per second. To put this in perspective, the US government’s Frontier supercomputer — the world’s first publicly confirmed exascale computer — was also designed around this scale. Tesla is attempting to reach exascale training capacity using custom silicon rather than off-the-shelf hardware.


Section 2 — Build vs. Buy Economics

The financial comparison between Dojo and cloud NVIDIA compute is not straightforward. The outcome depends heavily on utilization rates, time horizon, and which cost components are included. All figures below are estimates based on publicly available data and industry analysis.

MetricTesla Dojo (build)Cloud NVIDIA H100 (buy)Notes
Capital cost per ExaPOD (est.)$300M–$500M+ (est.)$0 upfrontDojo requires massive capex investment; cloud converts capex to opex
Operational cost per PFLOP-day (est.)$0.05–$0.15 (est., at scale)$0.50–$2.00 (cloud spot/on-demand, est.)Dojo cost advantage materializes only at high utilization
Break-even utilization (est.)60–80% (est.)N/ABelow this threshold, cloud is cheaper per unit of compute delivered
FlexibilityLow — fixed architecture, difficult to upgrade mid-generationHigh — rent latest NVIDIA silicon (B200/Blackwell) within days of availability
Latency to new hardware3–5 years per chip generationDays — cloud adds newest NVIDIA silicon as it becomes available
Data securityHigh — Tesla training data never leaves Tesla-controlled infrastructureMedium — cloud providers offer contractual protections, but data traverses shared infrastructure
Vendor riskTesla-controlled — no dependency on NVIDIA pricing or availability for training capacityExposed to NVIDIA pricing power and hardware availability cycles
Break-even point (est.)4–6 years of heavy utilization (est.)N/A — pay-as-you-go model with no fixed payback period

Reading the economics: The key variable is utilization. At 80%+ utilization sustained over four or more years, Dojo’s per-PFLOP cost falls well below cloud rates. At 40% utilization or below, the amortized capex cost per unit of compute delivered likely exceeds what Tesla would pay renting H100 clusters on demand. This makes Dojo’s economics inherently tied to Tesla’s ability to generate training workloads at scale — which is itself tied to FSD rollout speed, Optimus production volume, and the continued growth of Tesla’s labeled driving data corpus.

The cloud pricing comparison above reflects H100 on-demand and spot rates observed in 2024–2025. NVIDIA Blackwell (B200) cloud availability is expanding through 2026 and may shift the comparison further, since B200 performance-per-dollar significantly exceeds H100 on current benchmarks.


Section 3 — The Strategic Case for Dojo

Tesla’s stated rationale for Dojo goes beyond raw cost economics. Four strategic arguments are particularly compelling.

1. Data security and IP protection

Tesla’s FSD training data — billions of miles of labeled driving video from the global Tesla fleet — is among the most competitively sensitive proprietary datasets in the technology sector. Routing this data through cloud providers introduces IP and competitive intelligence risk, even under contractual NDAs. Training entirely on owned infrastructure eliminates this surface. For a company whose AI moat is fundamentally a data moat, this is not a trivial concern.

2. Custom silicon optimization

NVIDIA GPUs are designed to be general-purpose accelerators across a wide range of workloads. Dojo’s D1 chip is designed specifically for Tesla’s training workload profile: high-throughput video data ingestion, end-to-end neural network training on camera inputs, and large-scale data-parallel training jobs. Custom silicon optimized for a specific workload type can achieve 2–5x better performance-per-watt compared to general-purpose accelerators on that targeted workload (est.) — though this advantage is narrow and does not generalize beyond the intended use case.

3. Vendor independence and supply security

The NVIDIA H100 shortage of 2023–2024 demonstrated the risk of depending on a single-vendor supply chain for critical AI infrastructure. During the shortage period, cloud spot pricing for H100 instances surged 3–5x (est.) relative to pre-shortage baselines. Companies with prior access agreements maintained compute access; those without faced training delays. Dojo provides Tesla with guaranteed compute capacity that scales with Tesla’s own production capacity rather than NVIDIA’s supply allocation decisions.

4. Optimus data flywheel lock-in

As Optimus scales from prototype to mass production, it generates an entirely new category of training data: humanoid robot interaction data, manipulation task demonstrations, and policy feedback signals. Training increasingly capable humanoid policies requires continuous compute at scale. If Optimus reaches 50,000+ units deployed, the data generation rate and associated training compute demand could exceed what FSD training currently requires. Owning the compute layer means Optimus training costs are a function of Tesla’s own silicon economics, not an external vendor’s pricing structure.

5. Potential external revenue stream

Tesla has publicly referenced the possibility of offering Dojo compute capacity as a service to external AI companies. If Dojo reaches exaFLOP-scale and Tesla’s own utilization leaves headroom, selling access to spare capacity represents a new revenue stream in a market where compute scarcity is persistent. This option has no value if Dojo remains underutilized — but at high utilization with overflow demand, it becomes a real business.


Section 4 — The Case Against Dojo (Bear Thesis)

The strategic arguments for Dojo are real, but so are the counterarguments. Four bear-case concerns are worth taking seriously.

1. Opportunity cost of capex

Every dollar deployed in Dojo capex ($300M–$500M+ per ExaPOD, est.) could alternatively fund access to 5–10x more NVIDIA H100 or B200 compute in the short term, because cloud converts capex to opex and the cloud providers achieve economies of scale in hardware procurement that Tesla cannot match at comparable volume. If training velocity — iterations per unit of time — matters more than long-run cost efficiency, cloud may generate faster FSD improvement even at a higher cost-per-PFLOP.

2. Architecture obsolescence

Dojo D1 is fabricated on TSMC’s 7nm node. NVIDIA’s Blackwell B200 is fabricated on TSMC’s 4nm+ node, with performance improvements of approximately 5x over H100 on relevant benchmarks. A chip design cycle for a custom accelerator typically takes 3–5 years from tape-out to production deployment. By the time a Dojo D2 or next-generation custom chip reaches production, NVIDIA may have already shipped two further generations. The risk is that Dojo invests several years of capex and R&D to arrive at a performance level the commercial GPU market has already surpassed.

3. Software ecosystem immaturity

NVIDIA’s CUDA ecosystem has more than 15 years of library development, third-party framework support, and engineering talent depth. PyTorch, TensorFlow, JAX, and virtually every major ML research framework target CUDA as the primary execution backend. Dojo requires a Tesla-custom software stack — proprietary compilers, custom libraries, and bespoke training frameworks. This creates a talent sourcing disadvantage (fewer engineers know the stack), a tooling disadvantage (fewer open-source optimizations available), and a debugging disadvantage (less community knowledge to draw from). These are solvable problems with sufficient engineering investment, but they represent real friction costs.

4. Utilization risk

The economic case for Dojo depends on sustained high utilization over a multi-year payback period. Two scenarios could compress utilization below the break-even threshold: First, if FSD training needs plateau because the model reaches a performance level that is “good enough” for commercial deployment across the majority of use cases, the training compute demand does not grow at the rate needed to keep Dojo fully utilized. Second, if Optimus production ramps more slowly than projected, the anticipated surge in humanoid policy training demand arrives later, leaving Dojo under-loaded in the intervening period. Cloud compute scales down gracefully to near-zero when not needed; Dojo does not.


Section 5 — Dojo Implications for FSD and Optimus Timelines

The practical question for investors and observers is not whether Dojo is theoretically optimal, but whether it meaningfully changes the timelines and cost structures for Tesla’s two most important AI products.

MilestoneDojo contribution (est.)Without Dojo (cloud only)
FSD v14 to v15 generalization leapEnables continuous retraining on the full labeled dataset without cloud cost constraints (est.)Technically possible but estimated 2–3x more expensive at equivalent training scale (est.)
Optimus task generalization (10 to 50 tasks)Dojo capacity supports large-scale humanoid policy training at the data volumes Optimus deployment generatesBottlenecked by cloud H100 availability and per-hour cost at the required training scale
Optimus 50,000-unit training supportRequires approximately 5–10 ExaPODs of continuous training capacity (est.)Would cost an estimated $500M+ per year on cloud at equivalent compute (est.)
Dojo as external compute product2027–2028 potential window if utilization permits and capacity is available (est.)N/A — cloud model does not create this revenue option

FSD interpretation: The most concrete near-term benefit of Dojo for FSD is removing the cost ceiling on training data utilization. When training on cloud infrastructure, compute cost is a direct function of training hours — which creates financial pressure to limit training runs, reduce batch sizes, or sample the training data rather than training on the full corpus. At Dojo scale with fully amortized capex, the marginal cost of additional training compute approaches zero, potentially enabling more frequent model iterations and more exhaustive use of the available labeled data.

Optimus interpretation: The humanoid training implication is more speculative but potentially larger in magnitude. If Optimus achieves mass production at 50,000–100,000 units per year, each robot generates continuous interaction data that must be incorporated into policy updates. The required training compute scales with both the number of tasks being learned and the number of robots providing feedback data. At that scale, cloud economics become genuinely prohibitive — which makes Dojo’s fixed-cost structure the only viable path to sustaining the Optimus data flywheel at the rate Tesla’s production ambitions imply.


Section 6 — About This Series

This is article 34 in the Physical AI Benchmark Series. Previous articles have covered the ramp index, the humanoid race, unit economics, global competition, HD mapping, fleet operations, software and OTA, insurance and liability, consumer demand, partnerships, competitive moats, Cybercab versus Model Y, safety data, Waymo Gen 6, Optimus manufacturing, scorecard snapshots, the 2030 forecast scenarios, the investor framework, Waymo’s city expansion pipeline, Tesla’s state approval map, AV weather and climate constraints, the talent war, the regulatory calendar, robotaxi fare pricing, the AV data flywheel comparison, the humanoid deployment tracker, the supply chain analysis, the consumer adoption demand index, and the Waymo standalone valuation and IPO analysis.

This article adds the AI training infrastructure dimension: the build-vs-buy decision at the core of Tesla’s compute strategy, the architecture and economics of Dojo versus cloud NVIDIA clusters, and the implications for FSD and Optimus training capacity over the next three to five years. The build-vs-buy question will become increasingly consequential as AI model training costs continue to scale and as humanoid robot deployment creates new categories of training data demand.

Reminder: All cost estimates, performance figures, and timeline projections in this article are estimates based on publicly available information, analyst commentary, and technical presentations. They are not investment recommendations. Conduct your own due diligence and consult a licensed financial adviser before making any investment decisions.


Sources

Tags

Tip