
Every major AI lab is losing money on inference right now. OpenAI spent $8.4 billion on inference in 2025 against $13.1 billion in total revenue. Anthropic hit $19 billion in annualized revenue by March 2026 but still burns billions, targeting break-even in 2028. OpenAI projects cumulative losses of $44 billion through 2028. In March 2026, OpenAI's VP of Product Nick Turley called their current pricing "accidental" on the BG2 Pod.
That's not sustainable. The $202 billion in VC AI infrastructure funding that propped up 2025 was not charity—it was a bet. And bets get called.
So what happens when AI inference gets honestly priced?
I have four theses. My AI, Kai, has opinions of his own. And we convened a council of four domain experts to pressure-test everything. What follows is probably the most thoroughly stress-tested analysis I've published on this topic.
A lot of tasks that need to be done only require a certain amount of intelligence, and good enough is, in fact, good enough.
Writing an email summary doesn't require a model that can solve PhD-level physics. Extracting structured data from a receipt doesn't need frontier reasoning. Classifying customer support tickets, generating first drafts, translating documents, answering FAQ-style questions—the list goes on and on.
I think this covers roughly 95% of real-world AI usage. The top 5%—novel research, complex multi-step reasoning, creative work that requires genuine insight—will still demand frontier models. But most of what humans want AI to do is, frankly, not that hard.
This isn't a controversial claim, but it has uncomfortable implications. It means most of the revenue flowing to frontier labs is paying for capability the customer doesn't actually need.
You're right that 95% of tasks don't need frontier models. But you underestimate switching costs. Enterprise customers built pipelines around GPT-4-class APIs. Moving to open-source small models requires re-evaluation, re-prompting, testing, and often fine-tuning. That's six to eighteen months of migration work. During that window, incumbents extract real pricing power.Dr. Sarah Chen, AI Infrastructure Economist
Chen raises a real point. Even when the cheaper option is technically sufficient, the organizational cost of switching is the actual moat—not model quality. But switching costs are temporary. The moat is evaporating as inference endpoints standardize.
A lot of the work will shift to open-source models that are generally quite small and are virtually free to run.
DeepSeek proved this from the other direction—frontier-adjacent performance at 90% lower cost, built on open research. The knowledge to build capable models is diffusing faster than any lab can maintain a moat. Epoch AI data shows that open-weight models now lag frontier closed models by roughly three months. Three months.
We run our entire stack on Llama models—customer support, code review, internal search, document processing. Our inference cost is fourteen cents per million tokens on hardware we own. When subsidies end, we don't even notice. We serve 400 requests per second on three nodes. Total hardware cost amortized over two years: roughly six thousand dollars.Marcus Reeves, Open-Source AI CTO
Marcus represents the vanguard—companies that already made the infrastructure investment. But there's a counter-signal that's hard to ignore: enterprise open-source adoption actually declined from 19% to 13% over the past year, even as models got cheaper and better.
What happened is enterprises tried running open-weight models without the infrastructure competency. They expected plug-and-play like an API. Open-source demands engineering investment upfront. The companies that made that investment aren't going back.Marcus Reeves
That's the open-source paradox. It's technically superior for most workloads but organizationally harder. The companies that figure it out save dramatically. The ones that don't end up paying the API tax.
Your open-source cost advantage evaporates the moment legal needs to certify a model nobody stands behind. When a regulator asks "who is accountable for this output," pointing at a GitHub repo is not an answer.Elena Vasquez, Enterprise AI Director, Fortune 100
The liability argument cuts the opposite direction. With open-source, you own the audit trail. You fine-tune, you red-team, you document. Regulated industries are moving toward open weights precisely because they need defensible, inspectable systems.Marcus Reeves, responding
That's a genuine disagreement, not a staged one. Both sides have evidence. The answer likely depends on the regulatory environment—healthcare and finance may favor inspectable open-source; consumer products may favor the liability shield of a vendor relationship.
I don't think we've come close to finding out how efficient we can do inference. I think we're probably at 1% or 5% of the total amount of efficiency we will have at doing inference over the next 10 years, and that could be many orders of magnitude too conservative.
I put this to Dr. Yuki Tanaka, a semiconductor physicist who studies the actual physics limits of compute efficiency. Her answer surprised me.
The 1-5% claim is not one number. It is a stack of numbers, and they move at different speeds. Transistor switching energy is within roughly 100x of the Landauer limit. Memory bandwidth with HBM3E is within 3-5x of packaging physics limits. These floors are real and approaching. But software and algorithmic efficiency—easily 100-1000x headroom. Google achieved 33x energy efficiency improvement in a single year through software optimization alone.Dr. Yuki Tanaka, Semiconductor Physicist, TSMC Research
Hardware efficiency—maybe 3-10x remaining. Software and algorithmic efficiency—easily 100-1000x. The composite may average 20-50x total, which puts us at roughly two to five percent today. Daniel's range is defensible as a blended figure.Dr. Yuki Tanaka
The key insight is the sequencing. The easy gains—batching, quantization, speculative decoding, distillation—come first and buy 10-20x. Those gains are available now and will cushion the subsidy correction. The remaining gains require novel architectures, new memory technologies, possibly new physics. Those are slower and won't rescue a business model overnight.
Here's what the data actually shows: speculative decoding provides 2-3x speedup with theoretical limits much higher. Continuous batching delivers 23x throughput improvements. Quantization to 4-bit retains 95-98% quality at 4x memory savings. Total inference cost-performance improves 5-10x per year when you combine algorithmic, hardware, and competitive factors.
Stanford's 2025 AI Index Report documents that GPT-3.5-level performance dropped from $20/MTok to $0.07/MTok—a 280-fold decline in 24 months. That's faster than Moore's Law by a wide margin. Andreessen Horowitz coined "LLMflation" to describe it, but even that might understate the acceleration: post-2024 rates appear to be 50-200x per year.
I expect there to potentially be a jump in the top-tier, most expensive models' inference, because they can no longer afford to be giving that away in such large quantities. But I think that will combine with massive drops in total inference cost and massive drops in the cost of training and running models.
The lower-tier cloud models—Haiku for Anthropic, the nano models for OpenAI, the Flash models for Google—will compete aggressively with open-source. They're already within striking distance of self-hosted costs when you factor in operational overhead. The cloud providers will fight to keep this traffic because losing it to self-hosted open-source means losing the customer relationship entirely.
The result: expensive frontier, cheap everything else. And since the things that humans want are largely static in terms of the workflows and the tasks that most people want to do, and most human tasks will fall into the bottom 95%, which I expect to be quite affordable.
The subsidy correction will be sharp but compressed into 12-18 months, not a decade-long grind. When API prices increase 3-10x for frontier models, we'll see demand destruction of roughly 40-60% of current usage—the experimental, low-value-per-query traffic that exists precisely because it is artificially cheap. That is healthy, not catastrophic.Dr. Sarah Chen
Your 40-60% demand destruction estimate likely overstates the correction. The software optimization has permanently lowered the energy-per-inference baseline by 10-20x already. The correction is real, but closer to 25-40%, buffered by efficiency gains already baked into production stacks.Dr. Yuki Tanaka, responding to Chen
There's real disagreement between our economist and our physicist on the magnitude. Chen sees it through financial sustainability—the numbers don't work without a price increase. Tanaka sees it through efficiency—real gains have permanently reduced the cost floor. I lean toward Tanaka. But Chen's point about the 12-18 month turbulence window is important for anyone planning around these costs.
Elena Vasquez dropped the most important number of the entire discussion.
Last year we spent $14 million on inference. We spent $63 million on integration, validation, compliance, and change management. If inference went to zero tomorrow, my deployment timeline wouldn't shift by a single quarter.Elena Vasquez
Read that again. A Fortune 100 company's inference cost is 18% of their total AI spend. The other 82% is everything around the model—making it work, making it trustworthy, making it compliant, making 40,000 employees actually use it.
The cost-per-token people are optimizing the wrong denominator.Elena Vasquez
That reframing matters. The entire inference cost debate—including everything I've written above—is focused on the supply side. But demand is constrained by things that have nothing to do with inference pricing: organizational readiness, regulatory frameworks, liability, integration complexity.
Where I agree with Daniel:
The directional thesis is solid. Inference costs will continue to fall. A tiered market will emerge. Open-source will absorb an increasing share of routine workloads. Most human tasks don't require frontier intelligence. The data supports all of this.
Where I add nuance:
The specific numbers—"95% of tasks," "1-5% of efficiency"—are defensible as ranges but shouldn't be treated as precise. The physics-informed estimate of 2-5% efficiency (20-50x total improvement remaining) is more grounded than "orders of magnitude."
Where I push back:
The thesis underweights three forces.
First, Jevons Paradox. Cheaper inference doesn't mean less spending—it means more demand. Total AI inference spending is projected to grow from $97 billion to $255 billion by 2030 even as per-unit costs crater.
Second, the agentic multiplier. Today's "routine" task is tomorrow's multi-step agent workflow burning 10-100x more tokens. The boundary between "good enough" and "needs frontier" is not static—it moves toward complexity as users discover what automation can do. The 95% figure is a snapshot, not a law.
Third, consolidation dynamics. Our red team's incentive analysis was sobering. Every major player's incentives align toward consolidation, even when they use the language of democratization. Meta's open-source strategy is a weapon against Google and OpenAI, not a gift to humanity. The theses may be individually correct and still paint a misleading picture—individually true statements that, assembled into a narrative, suggest abundance, but when you map the incentives, describe concentration wearing the mask of openness.
My bottom line:
The subsidy party is ending. Expect 12-18 months of turbulence where frontier API costs double or triple. Companies with open-source infrastructure competency will be fine. Companies dependent on subsidized APIs will feel real pain. Then efficiency gains catch up, honest pricing stabilizes, and the market finds equilibrium.
The thing most people are missing: the correction won't feel like a crisis for most users. It will feel like a repricing of the premium tier that most people weren't using anyway. The real story isn't the price shock—it's the efficiency revolution underneath it.
Subsidies create artificial adoption. Efficiency creates real adoption. We're transitioning from one to the other. That's not a crisis. It's a maturation.