What AI Coding Agents Are—and Why They Matter Now
AI coding agents are software tools that use large language models to read, modify, and generate code, automate routine development tasks, and interact with build and test systems, promising higher software development productivity while also introducing new risks around quality, maintainability, and long-term technical debt. Compared with chat-style code assistants, agents operate closer to how engineers work: they run commands, edit files, and respond to feedback. Vendors and teams now frame them as a new tier of developer tools, beyond copy-paste suggestions. Supporters argue they shrink cycle times and reduce toil; critics warn they can flood codebases with subtle defects that are hard to detect. The central question is no longer whether agents can write code, but whether their impact on software projects and organizations will prove economically sustainable.
ClickHouse’s Measured Productivity Gains in a Demanding C++ World
ClickHouse provides one of the clearest real-world success stories for AI coding agents on a complex C++ codebase. The team describes three levels of AI-assisted coding, from simple copy-and-paste chat interactions to agents embedded in IDEs and fully autonomous systems in isolated environments. Their biggest wins so far sit at the middle layer: agents wired into the CLI and editor that can edit files, run tests, and propose patches while humans review and guide. According to The New Stack’s report on ClickHouse, “ClickHouse CI runs 20 to 80 million tests across about 600 commits and 300 pull requests a day,” and agents helped cut findings to 3 to 5 per 10 million test runs. Agents now handle boilerplate edits, config changes, merge conflicts, and flaky test fixes, freeing engineers to focus on architecture and tricky bugs.

From Slot Machine Coding to Structural Risk: Hotz’s Critique
Not everyone sees AI coding agents as a net win. George Hotz, known for jailbreaking the iPhone and building comma.ai, argues that agents are “a highly sophisticated statistical model designed to mimic the distribution of programming” rather than true programmers. After months using agents on projects like tinygrad and hardware reversing, he concluded that he could have done each task better and faster manually. He describes a pattern where the agent frontloads visible progress, then turns refinement into a slot machine: pull the lever, hope it finishes the job, and repeatedly be disappointed. His sharper warning is organizational. High-performing engineers tend to catch sloppiness, but lower performers may not, and they are now producing much more code with agents. In his view, this dynamic could flood large codebases with “buckets and buckets of slop” while eroding overall quality.
Short-Term Productivity vs Long-Term Technical Debt
The dispute over AI agent risks is not really about whether they can speed up coding today; it is about who pays the long-term maintenance bill. ClickHouse’s experience shows impressive localized ROI: agents resolve merge conflicts, review code, and fix flaky tests at scale, especially once models like Claude Opus 4.5 reached a level that worked on their large C++ codebase. But even there, engineers note that agents can generate plausible yet wrong hypotheses when debugging, so outcomes rely on human judgment. Hotz’s critique turns this into an economic argument: if organizations reward volume over clarity, agents could amplify technical debt, subtle bugs, and fragile abstractions that surface years later. Developer tools ROI, in this view, cannot be measured solely by pull request counts; it must include the cost of harder-to-read code, slower refactors, and quality regressions that slip through the cracks.
Choosing When AI Coding Agents Make Sense
The emerging consensus is that AI coding agents are neither magic nor useless; their ROI depends heavily on context. In large, test-heavy systems like ClickHouse, agents shine where tasks are repetitive, well-specified, and easy to verify: boilerplate, configuration edits, CI plumbing, and flaky tests. Autonomous agents can even open pull requests in constrained areas with strong safeguards. In less disciplined environments, Hotz’s concerns loom larger: weak tests, unclear ownership, and lax review can turn agents into accelerators of chaos. Teams should start by mapping tasks along two axes—risk if wrong and ease of automatic checking—and keep agents focused on low-risk, high-verifiability work. They should also track developer tools ROI over months, not days, watching how incident rates, refactor difficulty, and user-facing quality evolve. Where the data support it, agents can become reliable collaborators; elsewhere, they remain experiment rather than default.
