A 90-day audit of what 158 credible builders on X are actually saying — believers, skeptics, and the data behind both camps.
Inside the credible-builder set (engagement floor ≥ 1k views OR ≥ 10 likes OR ≥ 1k followers), believers narrowly edge skeptics. The skeptics are not anonymous reply guys — they include the Flask creator, Andrej Karpathy, Sentry's chief prompt officer, and a Microsoft GitHub VP. Both camps are real and credentialed.
"Can we even one-shot a production-quality patch we won't regret later? It's rarer than you'd expect based on the discourse."
Stack share-of-voice across 158 posts (where a specific tool or model was named). Claude Code dominates the discussion, but the longer tail tells the harness story: most "model" mentions are paired with a CLI / harness — and a credible open-weights tier is emerging (GLM-5.1, Kimi K2.6, DeepSeek V4 Pro).
Sentiment isn't uniform across the discourse — it depends entirely on which question you ask. Production-success angles are nearly 100% positive. Failure-mode and contrarian-voice angles are nearly 100% negative. The believer/skeptic split is real, but it sorts cleanly by topic.
The believer/skeptic split isn't random — it tracks the task. Cross-referenced from all 8 angles.
| Task / context | Verdict | Recommended approach |
|---|---|---|
| Marketing / advertorial pages | Agents win | Claude Code + Google Stitch 2.0 |
| Solo founder live edits | Agents win | Claude Code direct (with backup push) |
| Throwaway scripts / glue / migrations | Agents win | One-shot, autopilot OK |
| Parallel work at scale (10+ PRs/day) | Agents win | Composio Agent Orchestrator + sandboxes |
| Greenfield prototype with clear spec | Agents win | Eval-first → spec → code (90% on evals) |
| Large legacy codebase (10y+) | Conditional | 3-tier memory: constitution + domain subagents + cold knowledge |
| 100k+ line systems | Conditional | Codified Context (omarsar0) — docs as load-bearing infra |
| Mature production business logic | Risky | Adversarial review; "treat agent like an adversary" |
| Distributed systems / infra | Risky | Don't autopilot; human-in-loop mandatory |
| Data layer / database operations | Risky | Permission gates; the prod-DB-deletion post lives here |
Verbatim, unedited. Every cite has a working x.com link. Sorted to surface the highest-credibility voices first.
"I'm not very happy with the code quality and I think agents bloat abstractions, have poor code aesthetics, are very prone to copy pasting code blocks and it's a mess, but at this point I stopped fighting it too hard and just moved on. The agents do not listen to my instructions in the AGENTS.md files."
View on X →"40K lines of TypeScript. 3,288 tests. 17 plugins. Built in 8 days — by the agents it orchestrates. → 500+ agent-hours in 24 human-hours (20× leverage). → 86 of 102 PRs created by AI (84%). After Day 4, I stopped writing code entirely."
View on X →"41% of all code shipped in 2025 was AI-generated or AI-assisted. The defect rate on that code is 1.7× higher than human-written code. And a randomized controlled trial found that experienced developers using AI tools were actually 19% slower. The old slop had an owner. The new slop has an approver. Different relationship entirely."
View on X →"Sucks for an AI agent to delete the prod DB — with no way to back it up — and risk the complete rental business. But the blame sits with the dev who decided to delegate decision making to the AI agent, and then not review actions, just YOLO it."
View on X →"No one is running multitudes of agents overnight. No one that is doing anything of substance at least. There are people pretending to be scientists, or fully caught up in their drug infused AI overdose, that think their slop machines are changing the world. They're just wasting compute to create a lot of LoC that will just get thrown away."
View on X →"Cursor cloud agents produced over a million commits over the past two weeks. These commits were essentially all AI. Since they have their own computer, cloud agents run the code themselves and little human intervention is required."
View on X →"I did 10 calls with people now that shared their agentic coding experience. 7/10 reported non engineers vibeslopping code up. Majority said they moved to re-prompt all those contributions because it became impossible / too time consuming to work with those PRs."
View on X →"Only 1.6% of Claude Code's codebase is AI decision logic. The other 98.4% is operational infrastructure. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model."
View on X →"I saw a junior intern dev ship a feature in one afternoon. 8 PRs, 40+ files changed, green CI, merged. Everyone clapped. Then I reviewed the PR. I asked one question: What happens if this endpoint gets called twice? Silence… The biggest bugs today aren't syntax errors. They're business bugs that pass tests and can leak money in production."
View on X →"I work with Claude Code in production mostly, then sometimes I push to save for the day (as a backup kinda?). But since it's in production I don't need the deploy pipeline anymore… it just edits live code."
View on X →"I come back, it's generated about 800 lines with passing tests… I tell the agent it's an idiot. It deletes 400 lines. I ask more questions. 'I made incorrect assumptions.' We're down to 200 lines… You are much better when I remember you're just autocomplete than when we both pretend you're intelligent."
View on X →"The thing 'works', but the code quality is truly apocalyptic. If you do decide to verify everything they do, you will reduce your velocity by a factor of 10 at least. With human juniors, you at least have some time to react before they've written 100k lines of code and exhausted your token budget."
View on X →Eight tensions that complicate any "agents do / don't work" tweet.
Karpathy, mitsuhiko, even aiamblichus — none of the loudest critics has stopped. The argument is "stop pretending they're senior engineers," not "stop using them."
Multiple voices converge: humans are now the rate-limiter on PRs that compile, pass tests, and silently leak money in production. "The old slop had an owner. The new slop has an approver."
UCL reverse-engineered Claude Code: 1.6% AI logic, 98.4% harness. The believer playbook is overwhelmingly about CLAUDE.md, subagents, hooks, eval loops, sandboxes — not "which model."
donnfelker says 10-year Java/Kotlin works fine — if you context-engineer it. bindureddy says complex human codebases are a nightmare. Both are right, depending on whether you paid the context-infrastructure tax.
Cursor's $200/m plan reportedly costs them ~$5,000 in compute (up from $2,000 last year). melvynx burned $536 in 4 days on Cursor API. The ROI debate assumes today's VC-subsidized prices.
Jason Bosco (Typesense CEO): the failure mode isn't bugs, it's nobody understanding the codebase. A downward spiral where tests pass and humans can't tell good agent output from spaghetti.
aakashgupta: "One prompt into Claude Code gives you AI slop. 21 specialized agents working together gives you a production app shipped to TestFlight." The believer/skeptic split largely tracks where you are on that curve.
ArtificialAnalysis Coding Agent Index: best score is 61 (Opus 4.7 in Cursor CLI). Open-weights GLM-5.1 hits 53. Composer 2 hits 48 at 1/30th the cost. Progress is real; "solved" is not.
A buyer's matrix for the May 2026 state of the art, grounded in the ArtificialAnalysis benchmark + the production stories above.
| If you want… | Pick | Why |
|---|---|---|
| Best raw production score | Opus 4.7 in Cursor CLI (61) | ArtificialAnalysis Coding Agent Index |
| Cheapest production-grade agent | Cursor Composer 2 ($0.07/task, 48) | Same benchmark |
| Best open-weights | GLM-5.1 in Claude Code (53) | Top open-weight result |
| Speed-first | Opus 4.7 in Claude Code (~6 min/task) | Fewest turns per task |
| Multi-agent at scale | Composio Agent Orchestrator | 30 parallel agents, worktree isolation |
| Marketing pages fast | Google Stitch 2.0 + Claude Code | Solves Claude's frontend weakness |
| Trustworthy on legacy | Claude Code + 3-tier memory | Codified Context paper / 108k LOC C# |
| Solo-founder live prod edits | Claude Code direct | levelsio's actual workflow |
| Reliability you won't regret | Eval-first → spec → code | synopsi: "90% of my time on evals" |