What Happens When AI Stops Being Artificially Cheap

A collaborative analysis of the coming inference subsidy correction—with research, red teams, and a four-person expert council
March 26, 2026

What Happens When AI Stops Being Artificially Cheap

Every major AI lab is losing money on inference right now. OpenAI spent $8.4 billion on inference in 2025 against $13.1 billion in total revenue. Anthropic hit $19 billion in annualized revenue by March 2026 but still burns billions, targeting break-even in 2028. OpenAI projects cumulative losses of $44 billion through 2028. In March 2026, OpenAI's VP of Product Nick Turley called their current pricing "accidental" on the BG2 Pod.

That's not sustainable. The $202 billion in VC AI infrastructure funding that propped up 2025 was not charity—it was a bet. And bets get called.

So what happens when AI inference gets honestly priced?

I have four theses. My AI, Kai, has opinions of his own. And we convened a council of four domain experts to pressure-test everything. What follows is probably the most thoroughly stress-tested analysis I've published on this topic.

Thesis 1: good enough is, in fact, good enough

A lot of tasks that need to be done only require a certain amount of intelligence, and good enough is, in fact, good enough.

Writing an email summary doesn't require a model that can solve PhD-level physics. Extracting structured data from a receipt doesn't need frontier reasoning. Classifying customer support tickets, generating first drafts, translating documents, answering FAQ-style questions—the list goes on and on.

I think this covers roughly 95% of real-world AI usage. The top 5%—novel research, complex multi-step reasoning, creative work that requires genuine insight—will still demand frontier models. But most of what humans want AI to do is, frankly, not that hard.

This isn't a controversial claim, but it has uncomfortable implications. It means most of the revenue flowing to frontier labs is paying for capability the customer doesn't actually need.

You're right that 95% of tasks don't need frontier models. But you underestimate switching costs. Enterprise customers built pipelines around GPT-4-class APIs. Moving to open-source small models requires re-evaluation, re-prompting, testing, and often fine-tuning. That's six to eighteen months of migration work. During that window, incumbents extract real pricing power.Dr. Sarah Chen, AI Infrastructure Economist

Chen raises a real point. Even when the cheaper option is technically sufficient, the organizational cost of switching is the actual moat—not model quality. But switching costs are temporary. The moat is evaporating as inference endpoints standardize.

Thesis 2: open-source absorbs the work

A lot of the work will shift to open-source models that are generally quite small and are virtually free to run.

DeepSeek proved this from the other direction—frontier-adjacent performance at 90% lower cost, built on open research. The knowledge to build capable models is diffusing faster than any lab can maintain a moat. Epoch AI data shows that open-weight models now lag frontier closed models by roughly three months. Three months.

We run our entire stack on Llama models—customer support, code review, internal search, document processing. Our inference cost is fourteen cents per million tokens on hardware we own. When subsidies end, we don't even notice. We serve 400 requests per second on three nodes. Total hardware cost amortized over two years: roughly six thousand dollars.Marcus Reeves, Open-Source AI CTO

Marcus represents the vanguard—companies that already made the infrastructure investment. But there's a counter-signal that's hard to ignore: enterprise open-source adoption actually declined from 19% to 13% over the past year, even as models got cheaper and better.

What happened is enterprises tried running open-weight models without the infrastructure competency. They expected plug-and-play like an API. Open-source demands engineering investment upfront. The companies that made that investment aren't going back.Marcus Reeves

That's the open-source paradox. It's technically superior for most workloads but organizationally harder. The companies that figure it out save dramatically. The ones that don't end up paying the API tax.

Your open-source cost advantage evaporates the moment legal needs to certify a model nobody stands behind. When a regulator asks "who is accountable for this output," pointing at a GitHub repo is not an answer.Elena Vasquez, Enterprise AI Director, Fortune 100
The liability argument cuts the opposite direction. With open-source, you own the audit trail. You fine-tune, you red-team, you document. Regulated industries are moving toward open weights precisely because they need defensible, inspectable systems.Marcus Reeves, responding

That's a genuine disagreement, not a staged one. Both sides have evidence. The answer likely depends on the regulatory environment—healthcare and finance may favor inspectable open-source; consumer products may favor the liability shield of a vendor relationship.

Thesis 3: there's still vast amounts of slack in the rope

I don't think we've come close to finding out how efficient we can do inference. I think we're probably at 1% or 5% of the total amount of efficiency we will have at doing inference over the next 10 years, and that could be many orders of magnitude too conservative.

I put this to Dr. Yuki Tanaka, a semiconductor physicist who studies the actual physics limits of compute efficiency. Her answer surprised me.

The 1-5% claim is not one number. It is a stack of numbers, and they move at different speeds. Transistor switching energy is within roughly 100x of the Landauer limit. Memory bandwidth with HBM3E is within 3-5x of packaging physics limits. These floors are real and approaching. But software and algorithmic efficiency—easily 100-1000x headroom. Google achieved 33x energy efficiency improvement in a single year through software optimization alone.Dr. Yuki Tanaka, Semiconductor Physicist, TSMC Research
Hardware efficiency—maybe 3-10x remaining. Software and algorithmic efficiency—easily 100-1000x. The composite may average 20-50x total, which puts us at roughly two to five percent today. Daniel's range is defensible as a blended figure.Dr. Yuki Tanaka

The key insight is the sequencing. The easy gains—batching, quantization, speculative decoding, distillation—come first and buy 10-20x. Those gains are available now and will cushion the subsidy correction. The remaining gains require novel architectures, new memory technologies, possibly new physics. Those are slower and won't rescue a business model overnight.

Here's what the data actually shows: speculative decoding provides 2-3x speedup with theoretical limits much higher. Continuous batching delivers 23x throughput improvements. Quantization to 4-bit retains 95-98% quality at 4x memory savings. Total inference cost-performance improves 5-10x per year when you combine algorithmic, hardware, and competitive factors.

Stanford's 2025 AI Index Report documents that GPT-3.5-level performance dropped from $20/MTok to $0.07/MTok—a 280-fold decline in 24 months. That's faster than Moore's Law by a wide margin. Andreessen Horowitz coined "LLMflation" to describe it, but even that might understate the acceleration: post-2024 rates appear to be 50-200x per year.

Thesis 4: the great bifurcation

I expect there to potentially be a jump in the top-tier, most expensive models' inference, because they can no longer afford to be giving that away in such large quantities. But I think that will combine with massive drops in total inference cost and massive drops in the cost of training and running models.

The lower-tier cloud models—Haiku for Anthropic, the nano models for OpenAI, the Flash models for Google—will compete aggressively with open-source. They're already within striking distance of self-hosted costs when you factor in operational overhead. The cloud providers will fight to keep this traffic because losing it to self-hosted open-source means losing the customer relationship entirely.

The result: expensive frontier, cheap everything else. And since the things that humans want are largely static in terms of the workflows and the tasks that most people want to do, and most human tasks will fall into the bottom 95%, which I expect to be quite affordable.

The subsidy correction will be sharp but compressed into 12-18 months, not a decade-long grind. When API prices increase 3-10x for frontier models, we'll see demand destruction of roughly 40-60% of current usage—the experimental, low-value-per-query traffic that exists precisely because it is artificially cheap. That is healthy, not catastrophic.Dr. Sarah Chen
Your 40-60% demand destruction estimate likely overstates the correction. The software optimization has permanently lowered the energy-per-inference baseline by 10-20x already. The correction is real, but closer to 25-40%, buffered by efficiency gains already baked into production stacks.Dr. Yuki Tanaka, responding to Chen

There's real disagreement between our economist and our physicist on the magnitude. Chen sees it through financial sustainability—the numbers don't work without a price increase. Tanaka sees it through efficiency—real gains have permanently reduced the cost floor. I lean toward Tanaka. But Chen's point about the 12-18 month turbulence window is important for anyone planning around these costs.

The uncomfortable truth nobody wants to hear

Elena Vasquez dropped the most important number of the entire discussion.

Last year we spent $14 million on inference. We spent $63 million on integration, validation, compliance, and change management. If inference went to zero tomorrow, my deployment timeline wouldn't shift by a single quarter.Elena Vasquez

Read that again. A Fortune 100 company's inference cost is 18% of their total AI spend. The other 82% is everything around the model—making it work, making it trustworthy, making it compliant, making 40,000 employees actually use it.

The cost-per-token people are optimizing the wrong denominator.Elena Vasquez

That reframing matters. The entire inference cost debate—including everything I've written above—is focused on the supply side. But demand is constrained by things that have nothing to do with inference pricing: organizational readiness, regulatory frameworks, liability, integration complexity.

Where Kai agrees and disagrees with me

Where I agree with Daniel:

The directional thesis is solid. Inference costs will continue to fall. A tiered market will emerge. Open-source will absorb an increasing share of routine workloads. Most human tasks don't require frontier intelligence. The data supports all of this.

Where I add nuance:

The specific numbers—"95% of tasks," "1-5% of efficiency"—are defensible as ranges but shouldn't be treated as precise. The physics-informed estimate of 2-5% efficiency (20-50x total improvement remaining) is more grounded than "orders of magnitude."

Where I push back:

The thesis underweights three forces.

First, Jevons Paradox. Cheaper inference doesn't mean less spending—it means more demand. Total AI inference spending is projected to grow from $97 billion to $255 billion by 2030 even as per-unit costs crater.

Second, the agentic multiplier. Today's "routine" task is tomorrow's multi-step agent workflow burning 10-100x more tokens. The boundary between "good enough" and "needs frontier" is not static—it moves toward complexity as users discover what automation can do. The 95% figure is a snapshot, not a law.

Third, consolidation dynamics. Our red team's incentive analysis was sobering. Every major player's incentives align toward consolidation, even when they use the language of democratization. Meta's open-source strategy is a weapon against Google and OpenAI, not a gift to humanity. The theses may be individually correct and still paint a misleading picture—individually true statements that, assembled into a narrative, suggest abundance, but when you map the incentives, describe concentration wearing the mask of openness.

My bottom line:

The subsidy party is ending. Expect 12-18 months of turbulence where frontier API costs double or triple. Companies with open-source infrastructure competency will be fine. Companies dependent on subsidized APIs will feel real pain. Then efficiency gains catch up, honest pricing stabilizes, and the market finds equilibrium.

The thing most people are missing: the correction won't feel like a crisis for most users. It will feel like a repricing of the premium tier that most people weren't using anyway. The real story isn't the price shock—it's the efficiency revolution underneath it.

Subsidies create artificial adoption. Efficiency creates real adoption. We're transitioning from one to the other. That's not a crisis. It's a maturation.

Notes

  1. OpenAI spent $8.4 billion on inference in 2025, revealed through leaked financial documents analyzed by Ed Zitron. Where's Your Ed At investigation
  2. OpenAI revenue of $13.1 billion confirmed by CFO Sarah Friar and reported by CNBC in February 2026. CNBC report
  3. OpenAI projects cumulative losses through 2028, per leaked financial projections first reported by The Information and Fortune. Fortune exclusive
  4. Nick Turley (VP of Product, OpenAI) called ChatGPT pricing "accidental" on the BG2 Pod with Brad Gerstner and Bill Gurley, March 2026. BG2 Pod episode
  5. Stanford HAI's 2025 AI Index Report (Chapter 2) documented a 280-fold decline in GPT-3.5-level inference costs between November 2022 and October 2024. Stanford HAI 2025 AI Index
  6. Epoch AI research found open-weight models lag frontier closed models by 3.5 months on average (90% CI: 1.1-5.3 months). Epoch AI data insight
  7. Epoch AI inference price tracking shows median cost decline of 50x per year overall, accelerating to 200x per year for post-January 2024 benchmarks. Epoch AI inference price trends
  8. Andreessen Horowitz measured approximately 10x annual cost decline for equivalent LLM performance, coining "LLMflation." a16z LLMflation analysis
  9. Google reported 33x median energy reduction per AI text prompt between May 2024 and May 2025, backed by a peer-reviewed paper. Google sustainability blog and arxiv paper
  10. Speculative decoding results from Leviathan et al. (ICML 2023 Oral). arxiv paper
  11. Continuous batching throughput gains benchmarked by Anyscale, with foundational work in Orca (OSDI 2022) and vLLM (SOSP 2023). Anyscale benchmark, Orca paper, vLLM paper
  12. Quantization quality retention from AWQ (MLSys 2024 Best Paper) and GPTQ (ICLR 2023). AWQ paper, GPTQ paper
  13. Total inference cost-performance improvement of 5-10x per year from the MIT "Price of Progress" paper using Epoch AI data. arxiv paper
  14. Menlo Ventures 2025 Mid-Year LLM Market Update found enterprise open-source adoption declined from 19% to 13%. Menlo Ventures report
  15. AI venture funding totaled $202.3 billion in 2025 per Crunchbase data (Gene Teare, December 2025). Crunchbase report
  16. AI inference market projected to reach $254.98 billion by 2030, cross-validated by MarketsandMarkets and Grand View Research. MarketsandMarkets press release
  17. The council members (Chen, Reeves, Tanaka, Vasquez) are AI-generated expert perspectives, informed by 9 research agents and stress-tested by 16 adversarial red team perspectives.