MilikMilik

Why AI Agents Keep Failing at Long-Running Tasks—And What Developers Can Do About It

Why AI Agents Keep Failing at Long-Running Tasks—And What Developers Can Do About It

Microsoft Puts AI Agents’ Long-Task Claims to the Test

Enterprise buyers have been promised AI agents that can autonomously handle complex, multistep work—from document processing to full project execution. Microsoft’s own marketing for Microsoft 365 Copilot and Anthropic’s positioning of Claude as a “coworker” both lean heavily on this idea. But researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville at Microsoft set out to see what actually happens when AI agents are trusted with extended workflows. Their preprint, bluntly titled “LLMs Corrupt Your Documents When You Delegate,” introduces DELEGATE-52, a benchmark spanning 52 professional domains. Instead of simple spreadsheet sorting, the tests simulate realistic, long-running knowledge work such as accounting workflows, coding, crystallography, and music notation. The goal: measure whether large language models can maintain accuracy and structure across 20 interaction rounds. The results highlight severe AI model limitations in long-horizon scenarios, challenging the hype around fully autonomous AI agents long tasks and raising fresh enterprise automation challenges.

How Long Workflows Break Today’s Models

The DELEGATE-52 results are sobering for anyone expecting robust AI task persistence. Across frontier models like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, Microsoft researchers observed an average loss of 25 percent of document content over 20 delegated interactions. Across all tested models, average degradation climbed to 50 percent. Programming tasks fared comparatively better, but natural language-heavy workflows saw substantial corruption. To be deemed “ready” for a domain, a model had to preserve at least 98 percent of content after 20 interactions. Only Python programming met that bar. Worse, errors weren’t slow drifts; they often arrived as sudden, catastrophic failures, dropping scores by 10 to 30 points in a single round. In over 80 percent of model–domain combinations, benchmark scores fell to 80 percent or lower, meaning severe corruption. These findings expose structural AI model limitations when tasks extend beyond short, tightly scoped sessions.

Agents with Tools: More Power, More Corruption

One common belief in the AI community is that adding tools—file access, code execution, and other capabilities—turns LLMs into reliable agents. Microsoft’s team tested this assumption by wrapping models like GPT-5.4, 5.2, 5.1, and 4.1 in a basic agentic harness and running them through DELEGATE-52. Instead of improving outcomes, tools amplified the problem. The agent-enabled models showed an additional 6 percent degradation on average by the end of the simulation. In weaker models, the damage showed up as content deletion; in stronger frontier models, as content corruption—subtle but dangerous changes. Because long workflows compound these issues, the gap between performance after two interactions and after twenty was striking, underscoring the need for long-horizon evaluation. For enterprises that hoped tools would be the missing link for robust AI agents long tasks, the message is clear: agentic wrappers alone do not solve AI task persistence or context retention problems.

Enterprise Automation Meets Reality

The findings land at a sensitive moment for organizations pouring digital budgets into AI-powered automation. Deloitte figures cited in the research indicate that, on average, organizations dedicate 36 percent of their digital spend to AI automation. Much of that investment assumes agents can safely take over long-running workflows: document transformation pipelines, continuous ledger maintenance, or iterative report generation. Yet the Microsoft study shows that in 80 percent of simulated conditions, models severely corrupted documents. In real-world terms, an intern who lost a quarter of a critical file over the course of a project would be dismissed, not scaled up. For enterprises, this creates new enterprise automation challenges: how to leverage AI’s speed and breadth without risking catastrophic corruption of work artifacts. The researchers note improvements in families like OpenAI’s GPT over the past 16 months, but today’s reality remains that unsupervised delegation is risky outside a narrow set of domains such as Python coding.

Designing Around AI’s Long-Task Weaknesses

Despite the grim numbers, the study outlines a path forward for developers and platform teams. First, long-horizon testing like DELEGATE-52 should become standard before deploying agents into production workflows; quick, two-step evaluations won’t expose failure modes that only emerge after many rounds. Second, architectures must assume brittle AI task persistence: break workflows into smaller, self-contained segments; enforce strict, schema-based validations; and maintain canonical sources of truth that models cannot overwrite without checks. Third, introduce human-in-the-loop checkpoints at high-risk transitions—such as merges or reconciliations—especially in domains where the benchmark showed catastrophic corruption. Finally, treat tools as amplifiers, not safeguards: a file-writing agent can do more damage, faster, if its outputs aren’t continuously verified. Microsoft’s own conclusion is that users still need to closely monitor delegated workflows. For now, robust enterprise automation will come not from blind trust in AI agents, but from careful system design that cages their weaknesses while exploiting their strengths.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!