MilikMilik

Why AI Agents Still Fail at Real Work — And How Companies Are Fighting Back

Why AI Agents Still Fail at Real Work — And How Companies Are Fighting Back

AI Agent Reliability Meets the Reality of Enterprise Work

AI agents promised hands-free productivity for everything from coding to contract editing, but reliability problems are surfacing as companies try to automate real workflows. Microsoft Research recently warned that when large language models are delegated long editing chains on the same file, they can quietly corrupt the work product. Their DELEGATE-52 benchmark showed that frontier models lost about a quarter of document content over 20 revision cycles, while average degradation across models reached up to half. That kind of document corruption in contracts, policy drafts, or technical reports turns AI into a risky delegate rather than a dependable coworker. At the same time, software teams are seeing side effects of aggressive automation: Linux maintainers report a flood of low-quality AI-generated bug reports, while cloud providers face scrutiny over how much autonomy their coding agents should really have. Together, these cases expose a widening gap between enterprise AI automation demand and actual AI agent reliability.

Why AI Agents Still Fail at Real Work — And How Companies Are Fighting Back

When Agents Edit Alone: Document Corruption as a New Failure Mode

The Microsoft DELEGATE-52 study targets an overlooked failure mode: what happens when AI agents keep revising the same document over time. Rather than a single prompt-and-response, the benchmark simulates 20 rounds of delegated editing across 52 professional domains, from coding to crystallography and music notation. The models must preserve structure, intent, and factual details while integrating new instructions. Instead, researchers found that content gradually erodes. Frontier systems lost around 25 percent of material, and the average model lost up to 50 percent by the end of the chain. Edits often looked polished but smuggled in subtle shifts in numbers, conditions, or qualifications that a busy reviewer might miss. For document-heavy organizations, this means AI agents remain closer to fast drafting tools than trustworthy autonomous editors. The finding reinforces that human review, audit trails, and approval steps cannot be removed from workflows without introducing serious enterprise AI automation risks.

AWS Turns to Formal Logic to Fix Requirements Before Code

In software development, some of the most expensive bugs start in ambiguous or conflicting requirements, not in the code itself. AWS says its Kiro agentic development platform often encounters requirements with contradictions, gaps, or vague language that lead AI coding tools to make hidden decisions. To tackle this, AWS is rolling out a Requirements Analysis feature that combines large language models with an automated reasoning engine known as an SMT solver. First, the LLM rewrites natural-language requirements into precise, testable criteria. Then those criteria are translated into formal logic. The SMT solver attempts to mathematically prove whether all rules can hold simultaneously, flagging contradictions, undefined behaviors, and missing cases. Findings are presented back to developers as concise clarification questions that can be resolved in seconds. AWS reports that it has found requirement bugs in roughly 60% of specs examined so far, illustrating how formal verification software can become a crucial guardrail before AI agents generate any code.

Why AI Agents Still Fail at Real Work — And How Companies Are Fighting Back

Linux Maintainers Confront AI-Generated Noise in Bug Reports

While cloud providers try to harden AI coding agents, open-source maintainers are dealing with a different reliability problem: AI-generated bug report spam. Linux creator Linus Torvalds has called the project’s security mailing list “almost entirely unmanageable” due to a surge of reports suspected to be produced by automated tools. During the Linux 7.0 and 7.1 release candidate cycles, maintainers saw a spike in reported issues, many of them minor, duplicated, or low-quality. Torvalds believes multiple people are running the same AI scanners and privately submitting nearly identical reports through sensitive security channels, dramatically increasing triage workload without a corresponding increase in real vulnerabilities. The kernel community has accepted AI-generated code in some circumstances, but this wave of noisy reports shows that automating discovery without adding verification filters simply shifts the burden downstream. Human experts must still separate meaningful findings from machine-generated chaff.

Why AI Agents Still Fail at Real Work — And How Companies Are Fighting Back

Old-School Logic as the Next Layer of AI Safety

Across document editing, coding, and security triage, a pattern is emerging: enterprise automation is outpacing AI agent quality, forcing vendors to add new verification layers. Rather than relying on more neural networks to judge each other, companies like AWS are reviving decades-old formal methods such as SMT solvers and automated reasoning engines. This neurosymbolic approach lets large language models handle natural language while symbolic logic systems mathematically prove or refute the consistency of what agents plan to do. In practice, that means catching contradictory requirements before code is written, or eventually ensuring agent plans cannot delete crucial clauses in a contract without explicit authorization. For buyers, the message is shifting from “trust the model” to “trust the guardrails around the model.” AI agent reliability will increasingly depend on these logic-based safety nets, combined with human review, to keep document corruption, buggy code, and noisy reports from undermining automation gains.

Comments
Say Something...
No comments yet. Be the first to share your thoughts!