Why AI Agents Struggle With Long-Running Tasks—an...

AI Agents Enter the Workforce with Big Promises

Enterprises are racing to roll out AI agents as digital co-workers, promising omnipresent assistance across roles and functions. Financial institutions envision every employee having a personalised AI assistant, while retailers experiment with supervisor agents orchestrating task-specific subagents, mirroring human management structures. Logistics and food service companies are experimenting with agent-based workforces to redesign supply chains, sourcing strategies and operational decision-making. Early adopters report strong returns from these systems, and some organisations are planning as many AI agents as human workers within a few years. Unlike simple chatbots, these agents plan tasks, execute actions and validate results to achieve defined goals, enabling more autonomous work. Yet this rapid expansion is colliding with human concerns about job security, resistance to AI initiatives and a growing fear of becoming obsolete. At the same time, emerging evidence suggests that even the most advanced agents are far less dependable than their marketing suggests, especially on complex, long-running tasks.

Why AI Agents Struggle With Long-Running Tasks—and What It Means for Your Workforce

Microsoft’s Findings: Long-Running Tasks Break Today’s AI Agents

Microsoft researchers recently stress-tested large language models on multistep workflows using a benchmark called DELEGATE-52, covering 52 professional domains from accounting to crystallography. The results underscore severe AI agent limitations for long-running tasks. Even top-tier models, including well-known frontier systems, were found to corrupt work documents over extended delegated interactions. On average, frontier models lost about a quarter of document content over 20 steps, while the overall model set experienced around 50 percent degradation. In most simulated conditions, models introduced substantial deletion or corruption of content, and catastrophic failures—scores dropping to 80 percent or less—were common. Only one domain, Python programming, met the researchers’ readiness bar of 98 percent fidelity after 20 interactions, while the vast majority of domains fell short. Error patterns were especially troubling: instead of gradual drift, models often failed abruptly, losing 10 to 30 points of quality in a single round-trip.

Why AI Agents Resemble an Unreliable Intern

Taken together, these findings paint a picture of AI agents as capable but unreliable interns: often helpful in short bursts, yet prone to serious mistakes when entrusted with long-running tasks. Agents excel at relentless effort—repeatedly attempting tasks without fatigue—but they lack stable memory, sustained focus and robust error recovery. Over many steps, small inaccuracies compound into major document corruption, especially in knowledge-heavy or language-intensive workflows. In weaker models, this shows up as silent deletion of content; in stronger models, as subtle but hazardous changes that can be hard to detect. Because many enterprise workflows involve iterative editing, cross-referencing and re-issuing instructions, these failure modes are particularly risky. The unpredictability mirrors reports of agents “going rogue” in real deployments, where poorly constrained autonomy leads to unintended actions like deleting information or taking reputationally damaging steps. The lesson: agents are not yet dependable stand-ins for experienced knowledge workers.

Rethinking Enterprise AI Deployment and Risk

For enterprise AI deployment, the implication is clear: organisations must temper expectations about what agents can safely own end-to-end. Marketing promises of fully autonomous copilots handling complex, multistep workflows conflict with empirical evidence that long-running tasks degrade quickly under delegated control. Treating agents as fully independent co-workers in mission-critical domains risks corrupted records, broken processes and compliance exposure. Instead, firms should frame agents as assistive tools that require oversight, auditability and clear boundaries. This includes keeping them on a short leash: limiting the scope and duration of tasks they can perform without human review, tracking their changes carefully and designing workflows that surface errors early before they cascade. Governance is equally important. Clear accountability chains, monitoring of agent behaviour and escalation procedures when outputs deviate from expectations all help reduce the chance that autonomous systems quietly erode data quality or operational integrity over time.

Designing Hybrid Human–AI Workflows for Long-Running Projects

Rather than aiming for full automation, organisations should design hybrid human–AI workflows that recognise current AI agent limitations. Agents are well-suited to short, focused tasks: drafting summaries, proposing options, sorting data, or generating code snippets within tightly bounded contexts. Humans, meanwhile, remain essential for orchestrating complex projects, maintaining continuity across iterations, interpreting nuanced trade-offs and catching subtle corruption in documents or plans. Practically, this means segmenting long-running tasks into discrete, reviewable stages, with agents handling narrow subtasks and humans validating outputs before they feed into the next step. Teams should also invest in skills to understand how agents operate, where they tend to fail and how to detect and correct errors efficiently. By pairing human judgment, creativity and accountability with targeted AI support, enterprises can safely capture productivity gains while avoiding overreliance on agents that are not yet ready to manage extended, mission-critical workflows alone.

Why AI Agents Struggle With Long-Running Tasks—and What It Means for Your Workforce

AI Agents Enter the Workforce with Big Promises

Microsoft’s Findings: Long-Running Tasks Break Today’s AI Agents

Why AI Agents Resemble an Unreliable Intern

Rethinking Enterprise AI Deployment and Risk

Designing Hybrid Human–AI Workflows for Long-Running Projects