AI ROI Measurement: Why Tokens Don’t Equal Value

Defining the AI Productivity Gap

The AI productivity gap is the widening disconnect between the growing volume of measurable AI activity and an organization’s limited ability to show that this activity improves revenue, margins, product quality, or customer satisfaction in a traceable way. Enterprises now track AI tokens, code suggestions, and chatbot conversations with granular precision, but this new visibility has not solved AI ROI measurement. The result is an awkward tension: leaders see “employees with superpowers,” faster code generation, and more experiments, while finance teams struggle to tie any of it to P&L impact. Organizations are discovering that counting AI activity is far easier than measuring AI value. This gap is pushing executives to question whether they are building sustainable capabilities or just staging metrics theater around enterprise AI metrics that say little about business performance.

Uber’s AI Productivity Boom, ROI Measurement Bust

Uber illustrates the AI productivity gap in real time. CEO Dara Khosrowshahi says AI tools are creating “employees with superpowers,” and the company can see that roughly 10% of code changes are now generated by autonomous agents. Uber slowed hiring growth while shifting more spend into AI, betting that higher throughput per person will offset a smaller headcount. Yet President and COO Andrew Macdonald underscores the measurement crisis: “That link is not there yet, right? … it’s very hard to draw a line between one of those stats and, ‘Okay, now we’re actually producing 25% more useful consumer features.’” Uber can count tokens and AI-generated commits but lacks a clear way to link them to better rider experiences, higher margins, or more reliable services. The company’s challenge mirrors a broader AI ROI measurement problem across large enterprises.

From Token Maxxing to Metrics Theater

Across the industry, token consumption has become a noisy proxy for progress. Nvidia’s Jensen Huang suggested that a USD 500,000 (approx. RM2,300,000) engineer or AI researcher should consume at least USD 250,000 (approx. RM1,150,000) in tokens a year, and token volume soon turned into an internal scoreboard. At Google, executives highlighted quadrillions of tokens processed each month, while a Meta engineer built a “Claudeonomics” leaderboard ranking 85,000 employees by usage and celebrating “Token Legends.” Uber gamified consumption and burned through its 2026 AI coding budget in four months. According to reporting cited by Domo’s Ben Schein, Amazon even shut down an internal leaderboard and told staff to solve business problems instead of chasing usage. These examples show how enterprise AI metrics can slide into theater: high numbers, weak links to product quality, margins, or customer satisfaction.

Why AI Activity Metrics Don’t Equal Business Results

The Pilot Addiction and the Production-Scale Wall

Cheap experimentation has led many firms into what Kore.ai Chief Strategy Officer Cathal McCarthy calls “addicted to pilots.” Teams spin up impressive demos, run short proofs-of-concept, and celebrate fast wins, but little of this work reaches the messy realities of production-scale deployment. McCarthy notes that organizations tend to pick low-hanging fruit for pilots, which offers quick satisfaction but limited learning. Real organizational learning happens once AI systems face production workloads, legacy integrations, governance requirements, and support teams. Domo’s Ben Schein draws the same line from another angle: you can “vibe-code” a slick prototype, but you cannot “vibe code governance, security, and distribution.” Until AI projects survive this wall, enterprise AI metrics stay stuck at the activity layer, and the AI productivity gap widens between shiny prototypes and durable business outcomes.

Measuring AI Value Beyond Tokens and Tasks

The harder question is how to measure AI value rather than AI activity. Earlier generations of software teams learned that “lines of code” were a bad productivity metric; more code did not guarantee better software. Generative AI repeats the trap with tokens, code suggestions, and GPU utilization. Surveys show that 79% of organizations report productivity gains from AI at the individual level, yet only 29% report significant ROI, and 95% of AI pilots deliver no measurable P&L impact. Some of the most important effects—better product quality, higher customer satisfaction, fewer defects, faster cycle times, improved margins—require new measurement systems that span teams and workflows. AI accelerates tasks, but value lives in the handoffs between teams, the customer experience, and the income statement. Until those links are instrumented, AI ROI measurement will lag far behind AI activity.