MilikMilik

AI Model Pricing Hits Rock Bottom: How Cheap Inference Is Rewriting the Rules

AI Model Pricing Hits Rock Bottom: How Cheap Inference Is Rewriting the Rules
interest|High-Quality Software

What the new AI model pricing era really means

AI model pricing is the rapidly changing structure of how providers charge for token-based usage of large models, and today it is shifting from premium software margins to infrastructure-level economics that prize scale, efficiency and low inference costs for developers and startups. The headline change is brutal price compression. Tencent Cloud has cut prices for the DeepSeek-V4 series by up to 97.5%, while keeping model performance unchanged, turning a once-expensive reasoning option into a budget-line item. On the API side, Xiaomi’s MiMo V2.5 Pro now offers a reasoning-focused model at about USD 1 (approx. RM4.60) per million input tokens and USD 3 (approx. RM13.80) per million output tokens, pushing capable reasoning into the same conversation as cheaper chat models. For teams that previously rationed tokens, this shift resets what is financially possible.

DeepSeek and MiMo turn reasoning models into a price war

DeepSeek and MiMo are turning the high-end reasoning category into a full-blown API pricing competition. DeepSeek V4 Pro, already discounted heavily, is set to stay at one quarter of its original rate after its current promotion ends, while Tencent Cloud’s pricing update delivers a maximum 97.5% reduction for the broader DeepSeek-V4 series. MiMo V2.5 Pro lands in the same tier with clear, aggressive pricing for long-context, tool-heavy workloads. According to Xiaomi’s MiMo API pricing page, MiMo V2.5 Pro is listed at about USD 1 (approx. RM4.60) per million input tokens and USD 3 (approx. RM13.80) per million output tokens for prompts up to 256,000 tokens. This is the segment where token bills used to crush business models: software agents planning, reading files, writing code and looping tools. Now those workloads can be designed around capability rather than fear of runaway invoices.

Google’s full-stack advantage and the race to cheaper inference

While independent labs fight over API pricing, full-stack providers are pushing inference costs down by attacking the problem at the hardware and infrastructure layers. Google’s Gemini 3.5 Flash is pitched as a “good enough” model that rivals frontier systems while focusing on cost and speed, aimed squarely at companies burning through billions of tokens and facing sticker shock from complex AI agents. Sundar Pichai has highlighted that monthly usage of Google’s AI products has reached 3.2 quadrillion tokens, and analysts at William Blair estimate Google may pay 50% to 75% less for internal AI compute than competitors because it owns TPUs, data centers and cloud distribution. The message is clear: in a world where the model alone is no longer the product, owning the stack becomes a pricing weapon. Cheap AI models are becoming a strategic tool, not a marketing stunt.

AI Model Pricing Hits Rock Bottom: How Cheap Inference Is Rewriting the Rules

Why cheaper inference unlocks new products for startups

For startups, collapsing inference costs are less an abstract market trend than a direct expansion of what products they can build. Reasoning-heavy agents that read long documents, call tools repeatedly and generate large outputs used to feel uneconomic; now they are within reach. Lower AI model pricing means more freedom to iterate: longer user sessions, richer context windows and fewer harsh quotas during early testing. Most young companies do not need to train their own foundation model: they need reliable access, predictable costs and enough quality to make their workflows useful. Price compression across DeepSeek V4 Pro, MiMo V2.5 Pro and other cheap AI models also makes multi-model routing practical, because experimentation is no longer financially painful. However, the cheapest option is not always best: latency, uptime, data policies and tool reliability still decide whether a model saves money or quietly adds new failure costs.

Compressed margins, tougher middleware, and looming consolidation

The downside of rock-bottom inference costs is margin pressure for everyone who is not highly efficient or vertically integrated. Middleware platforms that sit between developers and base models now walk a tightrope: falling base prices can increase volume, but they also shrink the spread that routing layers live on. As one analysis of OpenRouter’s funding talks noted, aggregators must offer real value—smart routing, observability, governance, billing controls—rather than simply reselling access. At the same time, aggressive pricing from DeepSeek, MiMo and other labs is a way to enter production workflows before larger incumbents can lock in budgets. Over time, that dynamic points toward market consolidation. Only operators with strong cost structures, differentiated tooling or deep integration into customer stacks will survive in a world where inference is priced like electricity and every extra fraction of a cent is contested.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!