Blog

Is AI Getting More Expensive? Here's What's Actually True.

The documented version is more useful: frontier models are getting more expensive per task, some price increases are showing up through token behavior instead of headline rate cards, and the old assumption that every new model will be cheaper than the last is breaking.

May 26, 2026

AIAnalysisPricingAnthropicOpenaiGoogle

What you need to know: A claim is spreading that the era of cheap AI is over because Anthropic effectively raised the price of Claude Opus 4.7 through tokenizer “shrinkflation.” The core observation is real. Anthropic’s published price for Opus 4.7 is still $5 per million input tokens and $25 per million output tokens, the same as Opus 4.6, but Anthropic also says the new tokenizer can turn the same fixed text into up to 35% more tokens. That can raise the real cost of a request even when the price page looks unchanged. But the broader claim needs cleanup. This is not proof that all AI is getting more expensive. It is proof that price-per-token is no longer enough to understand what these systems cost.

A dark nwslyr analysis graphic asking whether AI is getting more expensive, with pricing cards for Anthropic, Google, OpenAI, and broader industry trends. — The current pricing debate is less about one rate card and more about the cost of completing real work.

There is a simple story going around right now: AI companies promised better, faster, cheaper models, then quietly pulled the ladder up. The post version is emotionally satisfying. It has villains, a trick, and a clean conclusion.

The actual story is less clean and more important.

AI is not suddenly expensive. Frontier AI was always expensive. What changed is that the cost is becoming harder to see from the rate card alone.

The Anthropic Claim Is Directionally Right

Start with the part that is true.

Anthropic did not raise the listed API rate for Claude Opus 4.7. The official pricing page lists Opus 4.7 at $5 per million input tokens and $25 per million output tokens. That is the same listed rate as Opus 4.6.

But Anthropic also says Opus 4.7 uses a new tokenizer. A tokenizer is the system that turns text into billable chunks. If the tokenizer changes, the same paragraph, prompt, SQL query, or code file can produce a different number of tokens.

Anthropic’s own documentation says the new Opus 4.7 tokenizer may use roughly 1.0x to 1.35x as many tokens as previous models, depending on the content. The company also says Opus 4.7 thinks more at higher effort levels, especially later in agentic workflows, which can produce more output tokens.

That is the whole issue.

The rate did not move. The denominator did.

If your exact same workload produces 25% more tokens, your bill goes up 25%. If it produces 35% more tokens, your bill goes up 35%. There is no mystery in the math.

But calling this a secret price hike is only half right. It is not hidden in the sense that Anthropic buried it in a dark room and hoped nobody would notice. It is documented. It appears in Anthropic’s migration and pricing materials. The better criticism is that most developers do not budget from tokenizer release notes. They budget from pricing tables.

That distinction matters.

A company can disclose a cost change and still make it too easy for customers to underestimate the impact.

The Number Is Not Automatically 35%

The other correction: “up to 35% more tokens” does not mean every workload costs 35% more.

It depends on the shape of the workload.

OpenRouter published one of the more useful third-party checks because it compared Anthropic’s native token counts against its own consistent tokenizer baseline. In its switcher cohort, OpenRouter found Opus 4.7 produced 32% to 34% more native tokens than Opus 4.6 for production-scale prompts above 10K tokens. Smaller prompts showed higher tokenizer inflation, around 42% to 45%.

That sounds worse than Anthropic’s number until you look at the actual cost impact. OpenRouter found prompt caching absorbed a large share of the inflation for longer prompts. Net cost increases ranged from roughly 12% to 27% across most prompt-size buckets, while very short prompts were actually slightly cheaper in its dataset because Opus 4.7 generated much shorter completions.

That is the part missing from most viral posts.

Tokenizer inflation is real. But billed cost is shaped by input tokens, output tokens, cache hits, completion length, effort settings, and the type of task being run.

The correct conclusion is not “Anthropic raised prices 35%.”

The correct conclusion is: Opus 4.7 can be materially more expensive for the same workload even though the listed per-token rate did not change, and you have to measure your own workload to know by how much.

That is less catchy. It is also true.

The Bigger Pattern Is Real

The reason this hit a nerve is that it is not happening in isolation.

Google’s Gemini 3.5 Flash is now listed at $1.50 per million input tokens and $9.00 per million output tokens. Gemini 3 Flash is listed at $0.50 and $3.00. Gemini 3.1 Flash-Lite is listed at $0.25 and $1.50.

So yes, Gemini 3.5 Flash is 3x Gemini 3 Flash on raw token price, and 6x Gemini 3.1 Flash-Lite.

That is a real change in the economics of the Flash tier. The word “Flash” used to mean cheap enough for high-volume production workloads. Now it means fast, capable, and still cheaper than Pro, but not automatically cheap.

OpenAI shows the same direction. Its public pricing page lists GPT-5.5 at $5 per million input tokens and $30 per million output tokens, compared with GPT-5.4 at $2.50 and $15. The detailed developer pricing tables vary by context and processing mode, but the relationship is the same: GPT-5.5 is roughly double GPT-5.4 across the comparable published rates.

Anthropic’s listed Opus price did not double. Google’s Flash price did not become Opus-level expensive. OpenAI still has cheaper mini and nano models. The market did not move in one uniform step.

But the trend at the frontier is clear enough: the newest, most agentic, most reasoning-heavy models are not continuing the old pattern of “same capability, lower price.” They are becoming more capable and often more expensive per useful unit of work.

That last phrase is the important one: per useful unit of work.

The pricing page is not the product anymore. The product is the completed task.

Other People Are Seeing the Same Thing

The public conversation has split into three camps.

The first camp is calling it shrinkflation. They are not completely wrong. When the same input becomes more billable tokens, the effective unit price changes even if the sticker price does not. That is a reasonable thing for developers to be angry about, especially if they built workflows around Opus 4.6 token behavior.

The second camp is defending the model. Their argument is that Opus 4.7 is simply better at hard tasks, especially software engineering, and that higher token consumption is the cost of better reasoning. This is also reasonable. A model that spends more tokens to solve the problem correctly can be cheaper than a model that fails quickly and forces retries.

The third camp is looking past Anthropic entirely. Simon Willison framed Gemini 3.5 Flash as part of a broader pattern: Google, OpenAI, and Anthropic all appear to be probing how much API customers are willing to pay for the next tier of capability. The Decoder made a similar point from the cost-per-task side: raw token price is becoming less useful because agentic models can burn very different numbers of turns and tokens to finish the same job.

That is the right frame.

The fight is not really about Opus 4.7. It is about whether AI pricing is still legible.

For two years, developers were trained to think in price-per-million-tokens. You could compare models, plug the numbers into a spreadsheet, and make a reasonable architectural decision. That world is ending.

A $1.50 model can be more expensive than a $2.00 model if it takes twice as many turns. A $5 model can be cheaper than a $2.50 model if it solves the task once instead of needing three retries. A model with the same rate card can become more expensive if its tokenizer changes. A model with a higher rate card can become cheaper if it uses fewer tokens.

This is what makes the current pricing discourse so messy. People are arguing over the wrong unit.

Deprecations Make the Economics Worse

The second real issue is model retirement.

AI companies retire models. That is normal. It is also disruptive.

Anthropic’s deprecation page currently lists Opus 4.7, Opus 4.6, Opus 4.5, and Opus 4.1 as active, while the original Opus 4 model is deprecated and scheduled for retirement on June 15, 2026. Google’s deprecation page lists Gemini 2.5 Flash and 2.5 Flash-Lite with October 16, 2026 shutdown dates, with Gemini 3.5 Flash and Gemini 3.1 Flash-Lite as recommended replacements. OpenAI also maintains a long deprecation history and has announced additional 2026 shutdowns for older API models.

None of that proves a conspiracy.

But it does create a ratchet.

If an older cheap model works for your product, and the provider retires it, you are not just choosing whether the new model is better. You are choosing whether the new model preserves the unit economics that made your product possible.

That is the actual developer anxiety underneath the viral posts.

Nobody wants to build a business on an API where the working model disappears, the replacement costs 3x more, and the explanation is “the new one is smarter.” Smarter is not a business model. Margin is.

The Hardware Context Is Real, But Not a Complete Explanation

There is a real infrastructure story behind this.

AI server demand is putting pressure on the memory supply chain, especially DRAM and enterprise SSDs. TrendForce first forecast in January that conventional DRAM contract prices would rise 55% to 60% quarter-over-quarter in Q1 2026, with NAND Flash up 33% to 38%. It later revised that Q1 outlook sharply higher: conventional DRAM up 90% to 95%, NAND Flash up 55% to 60%, and enterprise SSDs up 53% to 58%.

The pressure did not stop there. For Q2 2026, TrendForce projected conventional DRAM contract prices would rise another 58% to 63% quarter-over-quarter, while NAND Flash contract prices would rise 70% to 75%. TrendForce tied the move to AI and data-center demand, with suppliers reallocating capacity toward HBM, server DRAM, and enterprise SSDs.

That matters. Models do not run on vibes. They run on GPUs, memory, networking, storage, power, land, cooling, and capital.

But hardware inflation does not explain every API pricing decision. It is context, not proof. No public pricing page says, “we raised this model because DRAM and NAND got expensive.” Investors, infrastructure, demand, competition, utilization, and product strategy all matter.

The cleanest explanation is simpler: the cheap inference curve is no longer moving fast enough to hide the increasing compute appetite of frontier models.

For simple chat, summarization, extraction, and classification, cheap models still exist. For long-running agents, deep coding sessions, multimodal reasoning, and high-effort planning, the cost curve is moving the other way.

Cheap AI Is Not Over

This is where the viral version goes too far.

Cheap AI is not over.

Google still has Flash-Lite. OpenAI still has mini and nano models. Anthropic still has Haiku. DeepSeek just made a major price cut on its flagship V4-Pro model permanent, according to Reuters. Open-source and local models continue to improve. For a large class of workloads, AI is still absurdly cheap compared with hiring labor or running older software pipelines.

What is over is the lazy assumption that the next model will automatically be cheaper for your workload.

That assumption was never a law. It was a phase.

The first phase of the AI API market was customer acquisition. Every lab wanted developers building on its stack. Cheap models created ecosystems. Builders optimized around those prices. Products launched. Workflows formed.

The second phase is monetization. Frontier labs have massive compute bills, enterprise customers, investor expectations, and models that do more work per task. The bill has to show up somewhere.

Sometimes it shows up in the rate card.

Sometimes it shows up in the tokenizer.

Sometimes it shows up in output length.

Sometimes it shows up in reasoning budgets, rate limits, usage caps, cache behavior, or model retirement.

But it shows up.

What Developers Should Do Now

The practical answer is boring and unavoidable: stop pricing models by the rate card alone.

For every production workflow, measure cost per completed task. Measure retries. Measure average turns. Measure cache hit rates. Measure output length. Measure latency. Measure failure cost. Measure how often a cheaper model needs escalation to a stronger one.

Then build routing around that.

Use Haiku, Flash-Lite, mini, nano, local models, and deterministic code for the work that does not need frontier reasoning. Reserve Opus, GPT-5.5, Gemini Pro-class models, and high-effort modes for the jobs where the extra intelligence changes the result. Cache aggressively. Keep prompts shorter than your ego wants them to be. Do not let agents wander unless wandering is part of the task.

And most importantly: treat model upgrades like vendor migrations, not software updates.

Run evals. Run cost tests. Run old prompts through the new tokenizer before switching. Check the deprecation page. Check whether your “same” workload is actually the same workload after tool calls, reasoning tokens, compaction, and output behavior change.

The future does not belong to the company that uses the smartest model everywhere.

It belongs to the company that knows exactly when not to.