MilikMilik

Why Companies Can Measure AI Activity But Can’t Prove It Works

Why Companies Can Measure AI Activity But Can’t Prove It Works
Interest|High-Quality Software

The New Paradox of Enterprise AI Metrics

The paradox of enterprise AI metrics is that organizations can track AI productivity in extreme detail, from tokens consumed to code changes made, yet they still struggle to prove whether this activity translates into real business outcomes, revenue impact, or customer value. AI ROI measurement has become precise at the activity layer and vague at the results layer, creating a widening gap between what can be monitored and what can be justified to executives and boards. Enterprises now monitor AI productivity tracking dashboards that show token usage, prompt sessions, and AI-generated assets across teams. These tools give leaders confidence that AI is “busy,” but they rarely answer the harder question: is this work worth it? As the cost of AI pilots and experimentation rises, that disconnect turns from a curiosity into a strategic risk.

From Token Maxxing to ‘Why Are We Spending All This Money?’

Early enterprise AI adoption has been driven by token maxxing, where high usage becomes a status symbol rather than a value signal. Google publicly highlights quadrillions of tokens processed each month, and an engineer at Meta built an internal leaderboard ranking more than 85,000 employees by token consumption, handing out titles like “Token Legend” and “Session Immortal.” Inside these cultures, enterprise AI metrics reward activity volume, not outcomes. The backlash has started. Amazon shut down its internal token leaderboard after leadership urged teams to focus on customer and business problems instead of chasing usage. Uber reportedly burned through its 2026 AI coding budget in four months after gamifying consumption. Ben Schein of Domo sums up the turning point: once the bill arrives, executives ask, “Why are we spending all this money? What are we even doing with it?”

Why Companies Can Measure AI Activity But Can’t Prove It Works

Uber’s Productivity Boom, ROI Crisis, and the Measurement Trap

Uber is a clear example of AI productivity tracking outpacing AI ROI measurement. The company can see that roughly 10% of code changes are generated by autonomous agents, and internal tools show aggressive AI use across legal, marketing, and engineering. Leadership believes these agents create “employees with superpowers,” and hiring plans are already shifting to assume higher throughput per person. Yet Uber’s president Andrew Macdonald concedes that “the link is not there yet” between rising AI metrics and better customer-facing results. The company cannot show that more tokens or faster code changes deliver 25% more useful features to riders or drivers. This mirrors the old “lines of code” trap: counting more output without knowing whether it improves the product. Enterprises are learning that they can monitor AI at a fine-grained level while remaining blind to whether any of that activity matters to customers.

Pilot Purgatory and the Addiction to AI Experiments

Alongside token maxxing, many organizations are stuck in what Kore.ai’s Cathal McCarthy calls an addiction to pilots. Teams run impressive AI demos and limited trials that generate attractive internal metrics—tokens used, tasks automated, turnaround times reduced—but these pilots never scale into stable production systems that affect profit and loss. McCarthy argues that “organizational learning happens at production scale,” yet most pilots target low-hanging fruit and quick wins that avoid the hard integration work. Ben Schein at Domo makes a similar point from the governance side: you can prototype with “vibe coding” in an afternoon, but “you can’t vibe code governance, security, and distribution.” AI pilot to production transitions stall because the systems that matter—risk management, approvals, workflows—are not designed around AI, even while dashboards report rising usage and apparent productivity gains.

Why Companies Can Measure AI Activity But Can’t Prove It Works

Why Better Tracking Does Not Equal Better Outcomes

The core problem is that enterprise AI metrics are misaligned with value. Organizations can count tokens, prompts, GPU time, and AI-authored code, but they lack a trusted way to map that activity to business outcomes. Surveys show the scale of the gap: 79% of organizations report productivity gains from AI at the individual level, yet only 29% report significant ROI from generative AI, and 95% of AI pilots deliver no measurable P&L impact. The constraint sits between teams and systems, where faster execution meets slow coordination. AI accelerates tasks while bottlenecks move to approvals, integrations, and product decisions, so value evaporates at the handoff layer. This measurement trap is forcing enterprises to rethink AI ROI measurement: shifting from counting activity to tying AI work directly to shipped features, customer adoption, and financial results before expanding pilots or celebrating token statistics.

Milik earns a commission when you shop through our links, at no extra cost to you. Editorial content is independently selected by our team.

You May Also Like

Comments
Say something...
No comments yet. Be the first to share your thoughts!