Why AI Models Fail at Long-Running Tasks and What...

The Hidden Risk Behind AI’s ‘Autonomous’ Promise

AI-driven enterprise automation is often sold as a way to “set and forget” complex work. Vendors promote AI agents that can tackle multistep research, roam across files and applications, and return polished deliverables with minimal supervision. However, Microsoft researchers have found that this story breaks down once tasks become long-running and involve many handoffs to an AI model. Their DELEGATE-52 benchmark simulates sustained, professional workflows across 52 domains, from accounting ledgers to code and music notation. Instead of stable performance, they observed that even frontier models gradually corrupt or lose document content as interactions accumulate. For business leaders, this exposes a critical reality of AI model limitations: what looks impressive in a short demo can degrade badly across a full workday’s worth of steps. Autonomy is not the same as reliability, and long-running tasks are exactly where today’s AI struggles most.

What the DELEGATE-52 Study Reveals About Long-Running Tasks

In DELEGATE-52, an AI model is asked to iteratively edit and reorganize documents over 20 interactions, mimicking a long workflow. In an accounting scenario, for example, the system must split a seed ledger into category-based files, then merge them back chronologically into a single, coherent document. Across domains, Microsoft researchers found that leading models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost on average 25 percent of document content over those 20 steps, with an average 50 percent degradation across all tested models. Only Python programming met their “ready” bar of at least 98 percent accuracy after 20 interactions. Worse, errors did not slowly accumulate; they often appeared as sudden catastrophic corruption, dropping performance by 10 to 30 points in a single round. For enterprises, this means long-running tasks can look fine—until one late-step failure silently ruins the output.

Why AI Models Struggle With Task Continuity and State

These failures are not just about imperfect training data; they reflect deeper AI model limitations. Today’s large language models are essentially powerful pattern predictors, not systems designed to maintain long-term state the way a database or workflow engine does. Each interaction is processed with a snapshot of context, and information must be repeatedly compressed into prompts. Over many steps, important details are dropped, rephrased inaccurately, or overwritten. Microsoft’s study shows weaker models tend to delete content, while frontier models more often corrupt it—rewriting sections in plausible but wrong ways. Stronger models do not avoid errors; they delay critical failures to later rounds, meaning they can appear stable until a late-stage collapse. When these models are wrapped in basic AI agents with tools for reading, writing, and executing code, the problem persists: the researchers found that tool-using agents actually increased average degradation, not reduced it.

Enterprise Automation Meets Real-World Constraints

In controlled demos and short benchmarks, AI agents can look remarkably capable. But DELEGATE-52 highlights a gap between lab-style evaluations and the messy reality of enterprise automation. The researchers report that performance after just two interactions gives a misleadingly optimistic picture compared to what happens after 20. This matters for long-running tasks like month-end closing, policy updates, or large document transformations, where dozens of AI steps may be chained together. The study found that in more than 80 percent of model and domain combinations, the outcome fell into “catastrophic corruption” territory. Organizations are investing a substantial share of their digital budgets into AI automation, yet the tooling and evaluation practices often assume that giving an AI more tools automatically boosts AI agent reliability. Instead, the findings show that without careful constraints and monitoring, agents can quietly degrade critical business data over the course of a workflow.

How to Set Realistic Expectations and Design Safer Workflows

For teams deploying AI agents, the lesson is not to abandon automation, but to design around current limitations. Treat long-running tasks as high-risk: break them into shorter, well-bounded steps, and re-validate critical documents with independent checks rather than letting an agent rewrite them repeatedly. Reserve full delegation for domains where models are closer to “ready,” such as Python coding in the DELEGATE-52 benchmark, and keep humans in the loop for natural language-heavy or less common domains. Establish monitoring that tests performance over many interactions, not just single prompts, and be cautious about assuming tool-using agents are inherently safer. The research also shows rapid improvement over time, with some model families gaining dramatically on the benchmark. By acknowledging where AI agent reliability falls short today, enterprises can prioritize guardrails and governance now—while still positioning themselves to benefit as the underlying models continue to mature.

Why AI Models Fail at Long-Running Tasks and What It Means for Your Workflow

The Hidden Risk Behind AI’s ‘Autonomous’ Promise

What the DELEGATE-52 Study Reveals About Long-Running Tasks

Why AI Models Struggle With Task Continuity and State

Enterprise Automation Meets Real-World Constraints

How to Set Realistic Expectations and Design Safer Workflows