Microsoft’s DELEGATE-52 Benchmark: When Automation Quietly Corrupts Work
Microsoft researchers have delivered a sharp reality check to the promise of autonomous AI agents. In a preprint titled “LLMs Corrupt Your Documents When You Delegate,” they introduced DELEGATE-52, a benchmark simulating long-running workflows across 52 professional domains, from accounting and code authoring to crystallography and music notation. Instead of minor, gradual drift, they observed substantial document degradation as tasks progressed. Frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 lost on average 25 percent of document content over 20 delegated interactions. Across all tested models, average degradation reached 50 percent. Only Python programming achieved the researchers’ readiness bar of 98 percent accuracy after 20 steps. In more than 80 percent of model/domain combinations, the team saw “catastrophic corruption,” where final scores fell to 80 percent or lower—levels no enterprise would tolerate from a human colleague.
Why Long-Running Tasks Break AI Agents
The study highlights structural AI agents limitations rather than mere model immaturity. During long-running tasks, errors did not slowly accumulate; they tended to arrive in sudden, severe bursts. A single interaction could slash performance by 10 to 30 points, turning an apparently stable workflow into a corrupted mess. Weaker models mainly deleted content, while stronger frontier models more often produced subtle content corruption that is harder to detect. Adding tools and agentic behavior did not help. When models were wrapped in a basic agent harness with file reading, writing, and code execution, performance worsened by an additional six percent on average by the end of the simulations. Crucially, short tests mislead: performance after two interactions did not predict behavior after 20. For enterprise AI workflows, that means quick demos or POCs can mask the true fragility of long-horizon automation.
Implications for Enterprise AI Workflows and Automation
These AI model failures directly challenge common ambitions for enterprise AI workflows. Marketing pitches promise agents that can “tackle complex, multistep research” or autonomously operate on local files, apps, and cloud systems. Microsoft’s findings suggest that, outside a few domains like Python coding, such long-running tasks remain risky without close human oversight. The limitations cut across cloud automation, document-centric processes, and data pipelines. For example, workflows that repeatedly transform ledgers, reports, or configuration files can experience silent corruption well before completion. In weaker models, missing sections might be obvious; in stronger ones, incorrect but plausible edits may slip through review. As organizations devote a significant share of their digital budgets to AI automation, these hidden failure modes raise governance questions. Delegation to agents cannot yet be treated like delegating to a reliable employee; it still resembles working with an error-prone intern who occasionally breaks everything at once.
Design Strategies: Break Tasks, Add Checkpoints, Monitor Aggressively
For developers, the core lesson is architectural: do not design AI agents around unbroken, unsupervised long-running tasks. Instead, decompose workflows into small, verifiable units with explicit boundaries. Each unit should have clear preconditions, structured outputs, and automated validations so that errors are caught early rather than compounding over 20 steps. Checkpointing is equally critical. Persist intermediate states in versioned storage and compare current documents against earlier snapshots to detect sudden degradation. Where possible, use deterministic tools—such as schema validators, test suites, or reconciliation scripts—to confirm that an agent’s changes preserve required structure and content. Finally, build review loops into the workflow, especially for domains where the models are not yet “ready.” Microsoft’s own results show sustained improvement over time, but today’s reality is clear: enterprises must engineer safety rails around AI agents instead of assuming persistence, robustness, or reliable self-correction.
