FOMO is why enterprises pay for GPUs they don’t use — and why prices keep climbing

Enterprises can’t fix their GPU waste problem because the fix makes the problem worse. Releasing idle capacity would improve utilization, but the same shortage driving GPU prices up is exactly why no team will give capacity back. So the fleet sits at roughly 5%, billed by the hour, and the cycle tightens.

That pressure — repeated across thousands of enterprises over the past two years — is the reason most companies are now running their GPU fleets at roughly 5% utilization, according to Cast AI’s 2026 State of Kubernetes Optimization Report, which measured actual production clusters rather than surveying them. It’s also the reason nobody releases the idle capacity. Cast AI co-founder and President Laurent Gil has been tracking the dynamic for two years. “Many of the neoclouds are not cloud,” he told VentureBeat. “They are neo-real estate.”

Five percent is about six times worse than a no-effort baseline. Gil puts a reasonable human-managed target at around 30% once you factor in day cycles, weekends and normal business patterns. Five percent means enterprises are running their most expensive infrastructure line at a fraction of what doing nothing intentional would yield. And it lands at the same moment cloud compute pricing has broken its 20-year pattern. 

AWS quietly raised its reserved H200 GPU prices by roughly 15% on a Saturday in January, with no formal announcement. Memory suppliers pushed HBM3e prices up 20% for 2026. It is the first time since AWS launched EC2 in 2006 that a hyperscaler has meaningfully raised reserved GPU pricing rather than cut it. For now, the assumption under most enterprise AI budgets — that cloud compute gets cheaper every year— no longer holds at the top of the stack.

The cloud market has split in two

The pricing move matters less for what it is than for what it signals about where the shortage actually bites. Cloud compute has split into two layers. At the commodity layer, the old deflation still works. H100 on-demand pricing has fallen from roughly $7.57 per GPU-hour in September 2025 to around $3.93 today, with Lambda Labs and RunPod listing H100s under $3 and older A100s around $1.92. Nvidia T4 chips, once impossible to find on spot, now survive above 90% probability over 24 hours in several AWS regions.

At the frontier layer, it’s reversed. Nvidia received orders for 2 million H200 chips for 2026 against 700,000 in inventory. TSMC’s advanced packaging, which gates every HBM-equipped GPU, is booked through at least mid-2027. AMD has warned of its own 2026 price hikes citing the same crunch. Even A100 pricing, expected to soften as three-year reservations from 2023 expired, has started creeping back up. Gil’s read: FOMO is now spilling into older generations. Which layer an enterprise’s workloads sit on determines exposure.

Why 5%? Part one: the procurement loop

How does fleet utilization get to 5% when GPUs are this expensive? Gil’s account of enterprise GPU procurement is the clearest explanation I have heard.

An enterprise needs GPUs. It joins a hyperscaler waitlist. Nothing happens for weeks, sometimes months. Then a phone call: “You asked for 48, I have 36. Yours if you want them, but only on a one-year or three-year commitment, and three years is cheaper. If you don’t want them, five other companies on the list will take them.” The fear of losing allocation is acute. The commitment gets signed. Whether the workloads will consume that many GPUs, or whether that chip generation fits what will run on them, is not the operative question at the moment. The operative question is whether to say yes or lose the slot.

Once secured, those GPUs become too painful to release. Reacquiring them would take months, and nobody wants to be the team that gave capacity back and couldn’t get it. So the fleet sits, billed by the hour, whether it is used or not. Gil described enterprises paying on-demand rates, roughly three times more expensive than one-year reservations, because even the premium felt safer than risking release.

This is the paradox at the center of the 5% number. The obvious way to improve utilization is to release the GPUs you are not using. But the very shortage that makes those GPUs expensive is also the reason nobody releases them. So the fleet stays over-provisioned, the shortage persists, prices rise, and the FOMO that started the cycle gets reinforced. Every turn of the loop makes the next exit harder.

Forrester’s data corroborates the dynamic from a different angle. Principal analyst Tracy Woo found practitioners self-estimating Kubernetes waste at around 60%, close to what Cast AI measures directly. A widely observed pattern in Kubernetes practice explains the dynamic: engineers routinely request five to ten times the resources they actually use, because the cost of under-provisioning is visible (a pager goes off) and the cost of over-provisioning is invisible (one line on a cloud bill no engineer sees).

Why 5%? Part two: the architecture loop

Fixing procurement alone would not get the number to a good place, because the GPUs enterprises already hold are also wasteful on the inside. And the architecture half of the story is being diagnosed independently by teams that compete with Cast AI.

Anyscale, the company behind the Ray framework, published its own analysis on January 21 arguing that modern AI workloads routinely sit below 50% GPU utilization even when fleet size is exactly right, because of how the workloads are containerized. A single AI job moves through CPU-heavy stages (loading data, preprocessing), GPU-heavy stages (training or inference), and back to CPU. When all of that runs in one container, the GPU is allocated for the entire lifecycle but doing useful work for a fraction of it.

Gartner reaches the same conclusion independently. In a November 2025 research note on on-premises AI infrastructure, it recommends combining shared GPU usage across siloed projects with disaggregated inference, where prompt-processing and token-generation run on different hardware. Nvidia’s own Dynamo inference framework, unveiled for MLPerf Inference v6.0 last month, is built on the same principle.

Two vendors and an independent analyst firm (Cast AI, Anyscale, Gartner) converging on the same diagnosis is a stronger signal than any single vendor’s story, especially when one of them competes with the others. The two types of waste compound. A fleet over-committed at procurement time, running workloads whose containers leave GPUs idle waiting for CPU preprocessing, leaves enterprises at 5%. Fix one without fixing the other and most of the potential savings stay on the table.

What 40% utilization actually takes

If releasing GPUs is blocked by FOMO and procurement contracts are already signed, the only remaining lever is doing more useful work on the GPUs already committed. That is what “improve utilization” actually means in practice, and none of it requires buying a vendor’s product.

The simplest existence proof is the oldest technique in the book: GPU sharing across time zones. A bank with a credit decision engine serving Asian and US customers can run one pool of GPUs that serves both markets at different times. Nvidia published MIG (Multi-Instance GPU) and time-slicing primitives years ago. Most enterprises do not do it by hand because it is operationally boring and carries coordination overhead no one wants to own. An automated scheduler does it without getting tired.

Canva, the Australian design platform running over 100 production AI models, told Anyscale that it runs close to 100% GPU utilization during distributed training runs with roughly 50% cloud-cost reductions versus its previous setup. Inside Cast AI’s own data, a cluster of 136 H200 GPUs sustains 49% average utilization after applying GPU sharing, bin-packing (placing multiple workloads onto fewer, right-sized nodes), and a spot/on-demand mix. Ten times the fleet average and short of saturation, which is honest: most real enterprise fleets with mixed dev, staging, and production workloads probably sustain 40% to 70% at full optimization, not 100%. Even that is an order of magnitude better than 5%.

One caveat: the report’s 5% figure explicitly excludes AI labs running dedicated training. Organizations that look more like frontier labs than mixed enterprise fleets likely see much higher utilization already.

The procurement paths have stopped being interchangeable

What should enterprises actually do differently in 2026? The paths available in the market are no longer interchangeable, and each makes a different bet on where supply and demand land.

Procurement path

Typical H100-class price

Availability

Interruption risk

Commitment

Best fit

Hyperscaler on-demand

$3.00 to $6.98 per GPU-hour

Limited for H100/H200

None

None

Unpredictable workloads, short runs

Hyperscaler Capacity Blocks

$4.33 to $4.97 per GPU-hour (H200 after Jan 2026)

Pre-book up to 8 weeks; 6-month window

None in window

Medium-term

Scheduled training with known windows

Hyperscaler spot

Up to 90% discount

Variable; H100/H200 thin

High (minutes of warning)

None

Fault-tolerant inference, checkpointed training

Specialized GPU clouds (CoreWeave, Lambda, RunPod, GMI)

$1.99 to $3.99 per GPU-hour for H100

Broader for newer generations

Low to medium

Per-run or short reservation

Price-sensitive teams, flexible deployment

On-premise or colocation

Break-even around 12 to 18 months at sustained >60% utilization

3 to 9 month lead times

None

3+ year capex

High-utilization sustained workloads, strict compliance

Decentralized marketplaces (Vast.ai, io.net, Aethir)

Often under $1.00 per GPU-hour

Highly variable quality

High

None

Experimental or batch, non-production

The pattern that no longer works is picking one path and locking in for a multi-year plan. A more defensible 2026 default is mixing paths against the split: commodity providers for workloads that can live there, hyperscaler Capacity Blocks only for workloads that need the guaranteed window.

Five levers worth pulling

None of the following requires buying back capacity that’s already been committed.

  1. Continuous rightsizing, not one-time configuration. Resource requests set at deployment are almost always wrong six months later. Karpenter, OpenCost, and Kubecost are open-source options; Cast AI, ScaleOps, nOps, and PerfectScale automate the rightsizing itself. Cast AI reports its continuous rightsizing cuts provisioned CPU by roughly 50% on average across its customer base.

  2. Regional spot placement, especially for T4-class inference. Cast AI’s survival-curve data shows T4 spot interruption risk ranging from about 10% over 24 hours in eu-west-3 to 80% in eu-central-1 and us-east-1. Region selection is a reliability decision, not just a latency one.

  3. GPU sharing through MIG and time-slicing. Nvidia’s MIG feature partitions A100, H100, and H200 chips into isolated instances with dedicated compute and memory. vLLM and Dynamo implement continuous batching and disaggregated inference. Open primitives, no vendor contract required.

  4. Disaggregated runtime. Ray lets CPU-bound data prep scale independently from GPU-bound training or inference. 

  5. Commitment rebalancing. Reserved Instances and Savings Plans drift as workloads change. Cast AI, nOps, and Vantage track utilization against committed capacity and adjust the split automatically.

The bottom line

The single most practical question most enterprises have not asked this year: do they actually need an H200 at all?

H200 is designed for very large models (70B+ parameters) with very long contexts (128k+ tokens), where its 141 GB of memory (nearly double the H100’s 80 GB) is what lets the chip handle the load without slowing down. For smaller models, fine-tuned derivatives, quantized inference, and most production AI that actually ships to customers, an H100 does the same job at roughly 40% less per GPU-hour, according to Cast AI. An A100 often works, too, at roughly 60% less. The era of a single general-purpose GPU as the default answer is ending. Chip selection is becoming a routing decision, workload by workload, rather than a generational procurement decision.

Gil’s own observation sharpens this. At 80% utilization, a B200 genuinely delivers better unit cost per token than an A100: more powerful per hour than it is more expensive per hour. At 5% utilization, the math inverts. The premium chip compounds the waste. Buying the newest chip while underusing it is the most expensive possible version of the FOMO loop.

The first lever is free, and it is a workload audit rather than a software purchase. No GPU needs to be released to run this lever. Every GPU-backed workload in production is worth reviewing against one question: is the chip it runs on actually matched to what it does. A surprising number of H200 purchases in 2026 will turn out to have been made because the allocation came through, not because the workload required it. Then fix runtime architecture before spending on more reserved capacity. Mix commodity and reserved tiers against the split instead of picking one.

Whether the broader GPU market eventually rebalances is a separate question, and not one worth betting a 2026 budget on. Supply could catch up. Memory capacity could ease. Specialized inference silicon could pull demand off the H200 tier. All of that is possible. None of it is certain. What is certain is that procurement and runtime are the same problem seen from two sides: FOMO drives over-commitment at the front end, and container architecture leaves the over-committed fleet idle at the back. Enterprises that treat them as one loop can break it. Enterprises that keep treating them as two separate budget items will keep paying to run their most expensive infrastructure at 5%.