MilikMilik

Why AI Agents Fail at Long-Running Tasks—and What It Means for Workplace Automation

Why AI Agents Fail at Long-Running Tasks—and What It Means for Workplace Automation

A Harsh Reality Check for Autonomous AI in the Workplace

AI agents are being marketed as tireless digital coworkers that can autonomously handle entire workflows once given a goal. Yet new research from Microsoft sharply undercuts that promise. The company’s scientists tested how leading large language models behave when asked to execute long-running, multi-step knowledge work—exactly the kind of office tasks that vendors say can be safely delegated. Using a benchmark called DELEGATE-52, which simulates complex workflows across 52 professional domains, the team found that even top-tier models quietly corrupt documents as tasks progress. Instead of acting like reliable project teammates, these systems behave more like unsupervised interns who gradually break the assignment. The findings raise uncomfortable questions for enterprises eagerly wiring AI into their document repositories, business processes, and productivity suites under the assumption that “set and forget” automation is within reach.

Inside DELEGATE-52: How Long-Running Tasks Break AI Models

DELEGATE-52 is designed to mimic realistic, multi-step projects rather than simple, one-shot prompts. In accounting, for example, a model receives an initial ledger document, must split it into category-based files, then recombine everything chronologically into a final consolidated record. Across these extended interactions, Microsoft’s researchers observed substantial failures: frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost on average 25 percent of document content over 20 delegated interactions, while the average degradation across all tested models reached 50 percent. Performance varied by domain: programming tasks fared comparatively better, while natural language–heavy work degraded more severely. The team set a “ready for work” bar at 98 percent preservation after 20 rounds. Only one area—Python programming—met that threshold, highlighting how brittle current systems remain when they must repeatedly edit or transform the same evolving artifacts.

From Small Errors to Catastrophic Corruption

A key finding is that AI model failures on long-running tasks are not gradual, predictable drifts; they are sudden collapses. The study shows that models tend to experience catastrophic corruption events, where 10 to 30 percentage points of quality vanish in a single interaction rather than eroding steadily over time. In over 80 percent of model and domain combinations, end-of-simulation scores fell to 80 percent or below, a level the authors classify as catastrophic. Interestingly, weaker models mostly delete content, while stronger frontier systems are more likely to corrupt it—producing confident but wrong or mangled material. The researchers note that better models don’t avoid errors so much as delay them: they maintain a veneer of competence for more rounds before failing sharply. This behavior complicates monitoring, because early success in a workflow gives a false sense of security about how the agent will perform later.

Why Tools and Agents Aren’t Fixing the Problem Yet

Vendors often claim that giving models tools—like file access and code execution—turns them into robust agents capable of safely managing complex tasks. Microsoft’s team put that belief to the test by wrapping models in a basic agentic harness and letting them read, write, and execute code while working through DELEGATE-52 scenarios. The result was discouraging: tool-augmented agents actually performed worse than models operating without tools, adding an average of 6 percent extra degradation by the end of the simulations. In other words, more autonomy and capability did not translate into more reliability. Crucially, performance after just two interactions did not predict outcomes after 20, underscoring that short-horizon benchmarks and quick demos are poor proxies for real-world, long-running workflows where subtle errors compound into major corruption.

Enterprise Automation Risks: The Intern You Can’t Fire

For enterprises, these AI agents limitations have direct business implications. Organizations are already channeling a significant share of their digital budgets into AI automation, betting that agents will reliably replace portions of knowledge work. Yet the study suggests that an AI “coworker” that silently corrupts a quarter of a document over a project would be treated like an incompetent intern in any real office. The gap between AI marketing and actual capability creates enterprise automation risks: corrupted financial ledgers, broken compliance records, or subtly wrong analyses that evade detection until too late. While the research notes that model performance has improved markedly over the past 16 months, it also concludes that outside narrow domains such as Python coding, current systems are not ready for unsupervised, delegated workflows. For now, business leaders should treat AI agents as assistive tools that require close human oversight, not autonomous replacements for knowledge workers.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!