MilikMilik

Why Companies Can’t Prove Their AI Is Working

Why Companies Can’t Prove Their AI Is Working
Interest|High-Quality Software

The AI Measurement Paradox Inside the Enterprise

The AI measurement paradox is the growing gap between an enterprise’s ability to track AI usage in fine detail and its inability to show that this activity creates clear business results, such as better products, higher margins, or stronger customer value. Across large organizations, AI activity and token usage are now counted minute by minute: who is calling which model, how many tokens each prompt burns, and which teams push more AI-generated code. Engineers compare dashboards, product leaders watch token spikes around launches, and finance leaders see detailed bills. Yet when executives ask how these tokens translate into new revenue or lower costs, the data breaks down. Companies can prove that AI is busy everywhere. They cannot with the same confidence prove that AI is useful in ways that matter for the business.

From Token Maxxing to ROI Shock

Early enterprise AI programs fell into token maxxing: treating high token consumption as a badge of innovation rather than a cost tied to value. A leaderboard inside Meta ranked more than 85,000 employees by usage, handing out titles like “Token Legend” to the heaviest consumers. Uber and others gamified AI coding, racing to increase agent usage across engineering teams. According to reporting on Uber’s internal programs, roughly 10% of code changes are now generated by autonomous agents, and parts of its Claude Code budget were exhausted early in 2026. Then the bills and questions arrived. Executives started asking why they were spending so much and what they had to show for it. Amazon shut down an internal token leaderboard, while Uber’s leadership began questioning whether soaring consumption connected to any measurable lift in customer features or financial results.

Why Companies Can’t Prove Their AI Is Working

Why Productivity Metrics Don’t Equal Enterprise AI Value

Leaders can point to AI productivity metrics—faster code creation, more experiments, shorter drafts—but those numbers stop at the edge of individual work. Uber’s president Andrew Macdonald summed up the gap when he said the company cannot draw a clear line between higher AI usage and customer-facing results. Surveys show a similar pattern: about 79% of organizations report productivity gains from AI, but only 29% report significant ROI. Only 21% of S&P 500 companies can cite any measurable AI benefit. This echoes the old “lines of code” trap, where more code said nothing about software quality. In the same way, token counts, agent calls, and GPU utilization say little about whether AI improves products, reduces churn, or accelerates revenue. Activity is easy to count; value often sits several steps downstream in the workflow.

Pilot Addiction and the Failure to Learn at Scale

Another reason AI ROI measurement stalls is what Kore.ai’s chief strategy officer Cathal McCarthy calls “addiction to pilots.” Teams spin up impressive proofs of concept, each with strong demo appeal and short-term wins, but few survive contact with real production workloads. McCarthy argues that organizations tend to grab low-hanging fruit during pilots, which can be useful but does not teach them how AI behaves at full scale, with real users, messy data, governance needs, and compliance rules. Ben Schein at Domo puts it bluntly: you can vibe-code a compelling prototype in an afternoon, but you cannot vibe-code governance, security, and distribution. The result is a portfolio of isolated pilots that lack shared architecture, shared metrics, and shared learning. They look lively in slide decks, yet 95% deliver no measurable P&L impact.

Why Companies Can’t Prove Their AI Is Working

From Activity Metrics to Production AI Deployment Value

Escaping this trap means shifting from activity metrics to production-value metrics that connect AI directly to outcomes. Executives at Domo and Kore.ai stress starting with clear business problems: reduce ticket handling time, increase successful upsells, or cut time-to-release for a given product line. AI systems are then measured against those targets, not by tokens consumed. That also means mapping where work slows down today: handoffs between teams, approval queues, integration bottlenecks. AI that speeds up one task but jams these joints will not improve enterprise AI value. Instead, organizations need instrumentation that crosses teams and systems so they can see whether faster task execution leads to more features shipped, higher customer satisfaction, or better margins. Production AI deployment matters only when it is wired into this chain of measurable cause and effect.

Milik earns a commission when you shop through our links, at no extra cost to you. Editorial content is independently selected by our team.

You May Also Like

Comments
Say something...
No comments yet. Be the first to share your thoughts!