GPU Failure Rates and the Vocabulary Problem

March 21, 2026


The number is always wrong in the same way. “20% of GPUs fail.” “I heard it’s 30%.” The number is always attached to the same two words: “burn-in” and “failure rate.” And the number is usually not wrong: it’s just been detached from the stage it actually describes and reattached to the only vocabulary the person has.

Burn-in failure rates are 3–8%. Total manufacturing attrition is 15–25%. In-service failure runs ~9% per year. Those are three different numbers measuring three different things. But if the only terms you know are “burn-in” and “failure rate,” they all collapse into one sentence: “I heard 20% of GPUs fail during burn-in.”

Here is the actual waterfall.


There Are Four Stages, Not One

A datacenter GPU passes through four sequential manufacturing stages before it ships. Losses are multiplicative: a module must survive every stage.

StageWhat happensTypical loss
Die yieldWafer fabrication (TSMC N4/N5)5–15%
Packaging yieldCoWoS-L assembly (chiplets + interposer + HBM)5–15%
Burn-inThermal/voltage stress screening2–8%
System-level testFull module functional validation1–3%

The “typical loss” ranges above are industry-wide figures for complex multi-chip modules, drawn from semiconductor industry reporting and test house disclosures. They are not NVIDIA-specific published numbers. Published per-stage yield data for Blackwell does not exist. NVIDIA acknowledged shipping “low-yielding Blackwell material” in its Q3 FY2025 earnings commentary, and SemiAnalysis has reported on CTE mismatch issues during the CoWoS-L ramp, but neither source gives a clean per-stage breakdown.

What we can do is show what the math looks like at different points in the range:

  • Optimistic (mature production): 90% × 90% × 95% × 98% = 75.4% yield (24.6% total attrition)
  • Pessimistic (early ramp): 85% × 82% × 92% × 96% = 61.4% yield (38.6% total attrition)

These are illustrative, not measured. The point is structural: even with moderate per-stage losses, the compound effect produces total attrition in the 20–40% range. That’s where the “20% failure rate” comes from. It’s total attrition across the full waterfall, not burn-in alone. Burn-in accounts for roughly 3–8 percentage points of that total. The person who told your colleague “20% of GPUs fail” was right about the number and wrong about which stage they were describing.


Blackwell Should Be Structurally Harder

Three factors compound relative to prior generations:

Power density. Blackwell GPUs draw 700–1,000W per device. Burn-in chambers must deliver and dissipate this power at elevated junction temperatures (typically 125°C+). KYEC, which handles over 90% of NVIDIA’s AI chip testing, had to upgrade their burn-in ovens from 600W to 1 kW capacity just to handle Blackwell.

Dual-die architecture. The B200 uses two compute dies connected via NVLink-C2C on a CoWoS-L interposer. A failure in either die, the silicon bridge, or the interposer kills the entire module. The failure surface area is larger than a monolithic chip.

HBM stacking complexity. Each B200 module includes 8 stacks of HBM3e memory, with 8–12 DRAM dies per stack. A single defective die can kill an entire stack, and a failed stack can render the module non-functional.

According to SemiAnalysis reporting (not NVIDIA disclosure), a CTE mismatch between the GPU chiplets, silicon bridges, interposer, and substrate caused warping and failures under thermal cycling at 1,000W during the initial Blackwell ramp. NVIDIA addressed this by redesigning the top routing metal layers and adjusting the bump geometry. The B200 chip yield is now estimated at 90–95%, though not yet at TSMC’s internal targets.


The Other Number: In-Service Failure

Then there’s the separate question of what happens after the GPU ships and goes into production:

MetricSourceValue
Annualized GPU failure rateMeta (16K H100 cluster, 54-day window)~9%
Cluster MTTF (16K GPUs)Meta1.8 hours
Cluster MTTF (131K GPUs)Meta14 minutes
GPU share of unforeseen disruptionsMeta (Llama 3 405B training)30.1%
HBM memory share of disruptionsMeta (Llama 3 405B training)17.2%

The Meta numbers come from the Llama 3 technical report, Table 5: a 54-day snapshot of 16,384 H100 GPUs with 419 unexpected disruptions.

At a 9% annualized failure rate, datacenter GPUs behave more like consumables than traditional capital equipment. Burn-in screens out infant mortality so that surviving units enter the flat portion of the reliability bathtub curve. It does not extend the operational lifespan of units that pass. These are separate cost layers.

This is where someone might hear “30% of GPUs fail” and not be wrong, just confused about what they’re adding together. They may be compounding in-service failure over multiple years, or they may have taken total manufacturing attrition and added the in-service failure rate on top of it. Add manufacturing attrition on top and the lifetime loss from wafer to end-of-service is what it is.


Who Pays for the Dead GPUs

Manufacturing attrition is not absorbed by NVIDIA: it is priced into every unit sold. If total yield is Y, the effective cost per shippable GPU is manufacturing cost divided by Y. NVIDIA prices to maintain ~76% gross margins after yield losses. Every GPU that fails burn-in, packaging, or wafer test raises the floor price of every GPU that ships. The buyer pays for the dead units. They just never see them.

At current B200 pricing (~$40,000–50,000 per GPU, based on OEM and analyst estimates as of early 2026):

Total yieldCost multiplierYield cost embedded per GPU
90%1.11×$4,400–5,600
80%1.25×$10,000–12,500
75%1.33×$13,300–16,700
70%1.43×$17,100–21,400

At 75% total yield, roughly $13,000–17,000 of every B200’s price is paying for units that failed manufacturing. For a 10,000-GPU cluster at $45,000 per unit, that is $130–170 million in embedded yield cost.

B200 pricing sources: Northflank (2025), gpu.fm (2026), Epoch AI cost breakdown (2025). NVIDIA does not publish retail GPU pricing. These are OEM and analyst estimates.

If yield improves from 75% to 90%, the manufacturing cost basis drops by ~17%. Whether NVIDIA passes that through as lower prices or captures it as margin expansion depends on competitive dynamics and demand.


The Short Version

When someone says “20% of GPUs fail,” ask them: which stage?

  • Manufacturing attrition (wafer to shippable module): 15–25% mature, 25–40% early ramp. This is the number they heard.
  • Burn-in specifically: 3–6% mature, 5–10% early ramp. Real, but much smaller than the total.
  • In-service failure: ~9% annualized. A separate cost layer.
  • Lifetime loss (manufacturing + multi-year operation): can exceed 40–50% from wafer start to end of service.

The numbers are all real. The confusion is which number applies to which stage. Now you know.

One response to “GPU Failure Rates and the Vocabulary Problem”

  1. Good post.

    Check my whitepapers on breakdown of problems at large scale and estimating the MTBF for cluster runs.

    https://system-stack.com/news/whitepapers
    -“taxonomy of errors”

    I don’t remember seeing the 130k cluster run with a MTBF of 14 min, yet I am not surprised. And that is really worrisome.

    See the other whitepaper “Handling the scalability wall”.

    If a cluster run fails every 15min and your recovery is >15min (drain, resume from checkpoint, reach same iteration/epoch when it had failed) then your good put is below 50%. Or in other words the cluster is only useful 50% of the time. Then the compute efficiency kicks in at 70-80%.
    So your overall return is 35% (50% good put x 70% parallel efficiency).
    At close to GW scale, it is pretty insane the energy waste.

Discover more from Jason A. Hoffman

Subscribe now to keep reading and get access to the full archive.

Continue reading