AI ROI Measurement: Why Tokens Don’t Prove Value

From Token Maxxing to the AI ROI Measurement Trap

AI ROI measurement is the effort to connect fast-growing, easily tracked indicators of AI usage—such as token consumption, code generation, or agent sessions—to harder business outcomes like revenue growth, cost savings, and customer satisfaction that prove whether AI creates real enterprise value. Enterprises can now scrutinize AI productivity metrics with remarkable precision: tokens consumed, prompts issued, code lines generated, GPU utilization. Leaders see eye‑catching dashboards and conclude that AI adoption is racing ahead. Tech giants turned this into a game. Jensen Huang suggested a USD 500,000 (approx. RM2,300,000) engineer should consume USD 250,000 (approx. RM1,150,000) in tokens per year, while internal leaderboards at companies like Meta celebrated “Token Legends.” Yet this fixation on activity hides a core problem: counting AI usage is easy, but measuring AI value is not.

Uber’s Productivity Boom, ROI Crisis, and the Limits of Activity

Uber offers a clear view of this gap. AI agents now generate about 10% of code changes, and CEO Dara Khosrowshahi describes the effect as creating “employees with superpowers.” In response, Uber slowed hiring growth and shifted more spending toward AI, betting that higher throughput per person will pay off. Yet President and COO Andrew Macdonald has admitted that “it’s very hard to draw a line” from statistics like token use or AI‑generated code to “25% more useful consumer features.” According to enterprise survey data, 79% of organizations report individual productivity gains from AI, but only 29% see significant ROI. The numbers show the same pattern at scale: plenty of measurable activity, far fewer measurable outcomes. Faster coding, more experiments, and heavier AI consumption may feel like progress, but they do not prove that customers get better products or that the business is stronger.

When Leaderboards Drive Spend But Not Outcomes

Inside many enterprises, token counts have become a scoreboard. Alphabet leadership highlighted that Google now processes more than 3.2 quadrillion tokens per month, and hundreds of cloud customers each run more than a trillion tokens a year. At Meta, an internal “Claudeonomics” leaderboard ranked over 85,000 employees by token usage and honored top users with titles like “Token Legend” and “Session Immortal.” Uber followed similar patterns and burned through its 2026 AI coding budget in four months after gamifying AI agent use, prompting its COO to ask what the company was «even doing» with all that spend. Amazon reportedly shut down its own internal token leaderboard after concluding that workers were chasing usage instead of solving customer problems. These examples show how activity metrics can encourage wasteful behavior while leaving leaders without evidence of real AI value.

Why AI Activity Metrics Don’t Equal Business Value

Pilot Addiction and the Purgatory Before Production

The same measurement gap appears in how organizations run AI pilots. Kore.ai Chief Strategy Officer Cathal McCarthy warns that firms become “addicted to pilots,” treating a string of polished demos as proof of progress. Short experiments tend to focus on low‑hanging fruit: simple use cases where generative AI can draft emails, summarize documents, or answer routine questions. Those pilots generate high usage metrics and enthusiastic anecdotes but almost no measurable P&L impact. Surveys show that 95% of AI pilots deliver zero measurable effect on profit and loss, and only 21% of S&P 500 companies can cite any quantified AI benefit. McCarthy argues that real organizational learning happens at production scale, not in isolated sandboxes. Until pilots move into core workflows with clear success criteria, stakeholders see AI as an interesting prototype rather than a dependable, value‑producing capability.

From Activity Metrics to Outcome-Based AI Value

Executives like Ben Schein at Domo argue that “you can’t vibe code governance, security, and distribution.” In other words, moving from experiments to production means replacing superficial AI productivity metrics with outcome‑based measures that tie to strategy. The historical lesson from “lines of code” is clear: more activity can mean worse software. AI is replaying that mistake with tokens and generated output. To escape the trap, organizations need frameworks that map AI usage to concrete business goals: faster feature release cycles, higher customer satisfaction, fewer support tickets, or lower unit costs. That demands tracking work at the workflow and handoff layers, where friction between teams and systems often negates local productivity gains. Enterprise AI adoption will only move from hype to durable value when dashboards show not only how much AI people use, but how much it changes what the business delivers and earns.