AI Inference Costs: Why Cheaper Tokens Cost More

Cheap Tokens, Expensive Outcomes: Defining the New AI Economy

The AI price collapse describes the rapid fall in per-token inference costs for capable models, even as total AI spending rises because businesses are feeding more tokens into more powerful systems and increasingly agentic workflows. Token pricing trends show that what companies pay for each piece of text processed is going down, yet their overall AI model expenses are going up as usage explodes and some premium tiers become more costly. This contradiction matters for anyone planning large-scale deployments: lower AI inference costs make advanced tools more accessible, but they also encourage longer contexts, more tool calls, and more ambitious products that burn through billions or even quadrillions of tokens. In this new landscape, price per million tokens tells only part of the story; how models are used and which tier gets chosen often determines the real bill.

MiMo, DeepSeek and the Race to the Bottom on Inference Costs

On paper, AI looks cheaper than ever. Xiaomi’s MiMo V2.5 Pro, a reasoning-focused large model, is listed at about USD 1 (approx. RM4.60) per million input tokens and USD 3 (approx. RM13.80) per million output tokens for prompts up to 256,000 tokens. DeepSeek’s V4-Pro goes further, with pricing set to remain at one quarter of its original rate after a discount period ends, signaling how fierce the competition has become. These numbers push serious reasoning workloads into a budget range that once seemed unreachable for startups. Founders building research assistants, coding agents, or legal review tools can now test models that would have been uneconomic a year ago, with longer sessions and richer context. For these users, collapsing AI inference costs look like a pure win, but the real impact depends on how much more work they now ask models to do.

When the ‘Cheap’ Tier Gets Costly: Gemini 3.5 Flash and Friends

At the same time, some of the models marketed as affordable are moving in the opposite direction. Google’s Gemini 3.5 Flash, the latest in its budget line, costs USD 1.50 (approx. RM6.90) per million input tokens and USD 9 (approx. RM41.40) per million output tokens, compared with USD 0.50 (approx. RM2.30) and USD 3 (approx. RM13.80) for Gemini 3 Flash. That is a threefold jump in the tier meant for high-volume, low-stakes workloads. Independent testing from Artificial Analysis found that running a benchmark suite on 3.5 Flash cost about USD 1,550 (approx. RM7,130), while the higher-tier Gemini 3.1 Pro came in around USD 890 (approx. RM4,100), because the newer Flash both charges more per token and generates more tokens on multi-step agent tasks. Here, the cheaper label hides higher effective AI model expenses once usage patterns are taken into account.

The Great AI Price Collapse: Why Cheaper Tokens Don’t Mean Smaller Bills

Google’s Full-Stack Advantage: Cheap Tokens, Not-So-Cheap Models

Google is leaning on its full-stack infrastructure to reframe the debate around AI inference costs. With control over chips, data centers, and software, it can offer tokens that appear cheaper than rival frontier models while still protecting margins. Sundar Pichai said that if top Google Cloud customers moved 80% of their AI workloads to a mix of Gemini 3.5 Flash and other frontier models, they could save more than USD 1 billion (approx. RM4.6 billion) a year, highlighting the scale at which token pricing trends matter. At the same time, Gemini 3.5 Flash has crept close to Pro-level pricing for shorter prompts, narrowing the gap that once clearly separated capability tiers. Google’s strategy shows how infrastructure control lets a provider offer discounts in one slice of the stack while nudging customers toward more profitable, higher-capability tiers elsewhere.

Startups Get Discounts, Enterprises Get Complexity

Across the market, aggressive discounting and selective price hikes are creating a split between startups and large enterprises. Some smaller AI providers are slashing rates dramatically, with examples of 97.5% reductions that turn once-luxury reasoning workloads into almost infrastructure-like commodities. This helps early-stage teams run long-context agents, code assistants, and data-cleaning tools without crushing margins. But established vendors are also raising prices in less obvious ways: OpenAI has doubled per-token rates across model generations, while Anthropic held Claude Opus pricing steady but changed its tokenizer so the same text becomes more billable tokens. In parallel, GitHub has shifted Copilot toward token-based billing. The headline story of cheap AI hides a more complex reality where capability tier, tokenizer behavior, and agentic usage patterns drive actual AI model expenses far more than the sticker price per million tokens.