X Discourse Deep Dive · 12 Feb 2026 → 12 May 2026 · n = 158 credible posts

Can AI agents write production-ready code?

A 90-day audit of what 158 credible builders on X are actually saying — believers, skeptics, and the data behind both camps.

🎯 158 credible posts 👁 23,425 median views ⭐ 59.5% high-credibility (4–5/5) 🔀 8 angles · parallel X-search

01The split among credible builders

Inside the credible-builder set (engagement floor ≥ 1k views OR ≥ 10 likes OR ≥ 1k followers), believers narrowly edge skeptics. The skeptics are not anonymous reply guys — they include the Flask creator, Andrej Karpathy, Sentry's chief prompt officer, and a Microsoft GitHub VP. Both camps are real and credentialed.

Believers (positive)

46.8%

Skeptics

36.7%

Contrarians

8.9%

Neutral

7.6%

45.6%

Combined skeptic + contrarian. Nearly half the credible discourse pushes back.

1.6%

Of Claude Code's codebase is AI decision logic. The other 98.4% is harness. (UCL paper, akshay_pachaar)

30×

Cost-per-task variance across 9 model+harness combinations on the ArtificialAnalysis Coding Agent Index.

1.7×

Defect rate of AI-generated code vs human-written, per aakashgupta's 556k-view post citing 2025 data.

"Can we even one-shot a production-quality patch we won't regret later? It's rarer than you'd expect based on the discourse."

— David Cramer (@zeeg, fractional executive · chief prompt officer @ Sentry) · 713,940 views · post →

02What people are actually using

Stack share-of-voice across 158 posts (where a specific tool or model was named). Claude Code dominates the discussion, but the longer tail tells the harness story: most "model" mentions are paired with a CLI / harness — and a credible open-weights tier is emerging (GLM-5.1, Kimi K2.6, DeepSeek V4 Pro).

Claude Code

Cursor (incl. CLI)

Codex

OpenHands

Opus 4.7

GLM-5.1

OpenCode

Kimi K2.6

GPT-5.5

Aider

Gemini CLI

DeepSeek V4

03Each angle has its own camp

Sentiment isn't uniform across the discourse — it depends entirely on which question you ask. Production-success angles are nearly 100% positive. Failure-mode and contrarian-voice angles are nearly 100% negative. The believer/skeptic split is real, but it sorts cleanly by topic.

Production success stories

Harness & tooling

Head-to-head benchmarks

Cost & value

Vertical & context fit

Code quality & review

Failure modes

Skeptic / contrarian voices

Believer Skeptic Contrarian Neutral

04Where agents win, where they collapse

The believer/skeptic split isn't random — it tracks the task. Cross-referenced from all 8 angles.

Task / context	Verdict	Recommended approach
Marketing / advertorial pages	Agents win	Claude Code + Google Stitch 2.0
Solo founder live edits	Agents win	Claude Code direct (with backup push)
Throwaway scripts / glue / migrations	Agents win	One-shot, autopilot OK
Parallel work at scale (10+ PRs/day)	Agents win	Composio Agent Orchestrator + sandboxes
Greenfield prototype with clear spec	Agents win	Eval-first → spec → code (90% on evals)
Large legacy codebase (10y+)	Conditional	3-tier memory: constitution + domain subagents + cold knowledge
100k+ line systems	Conditional	Codified Context (omarsar0) — docs as load-bearing infra
Mature production business logic	Risky	Adversarial review; "treat agent like an adversary"
Distributed systems / infra	Risky	Don't autopilot; human-in-loop mandatory
Data layer / database operations	Risky	Permission gates; the prod-DB-deletion post lives here

05The receipts — twelve voices, both camps

Verbatim, unedited. Every cite has a working x.com link. Sorted to surface the highest-credibility voices first.

@karpathycontrarian

AI researcher · ex-OpenAI / Tesla · Stanford PhD

"I'm not very happy with the code quality and I think agents bloat abstractions, have poor code aesthetics, are very prone to copy pasting code blocks and it's a mess, but at this point I stopped fighting it too hard and just moved on. The agents do not listen to my instructions in the AGENTS.md files."

👁 819K views ★ 5/5

View on X →

@agent_wrapperbeliever

Orchestrator of agent orchestrators · @Composio

"40K lines of TypeScript. 3,288 tests. 17 plugins. Built in 8 days — by the agents it orchestrates. → 500+ agent-hours in 24 human-hours (20× leverage). → 86 of 102 PRs created by AI (84%). After Day 4, I stopped writing code entirely."

👁 590K views ★ 5/5

View on X →

@aakashguptaskeptic

Writer / builder on product + AI

"41% of all code shipped in 2025 was AI-generated or AI-assisted. The defect rate on that code is 1.7× higher than human-written code. And a randomized controlled trial found that experienced developers using AI tools were actually 19% slower. The old slop had an owner. The new slop has an approver. Different relationship entirely."

👁 556K views ★ 5/5

View on X →

@GergelyOroszcontrarian

@Pragmatic_Eng · ex-Uber / Skype

"Sucks for an AI agent to delete the prod DB — with no way to back it up — and risk the complete rental business. But the blame sits with the dev who decided to delegate decision making to the AI agent, and then not review actions, just YOLO it."

👁 420K views ★ 5/5

View on X →

@zeegcontrarian

Chief prompt officer @ Sentry · fractional exec / founder

"No one is running multitudes of agents overnight. No one that is doing anything of substance at least. There are people pretending to be scientists, or fully caught up in their drug infused AI overdose, that think their slop machines are changing the world. They're just wasting compute to create a lot of LoC that will just get thrown away."

👁 714K views ★ 5/5

View on X →

@mntruellbeliever

CEO · Cursor

"Cursor cloud agents produced over a million commits over the past two weeks. These commits were essentially all AI. Since they have their own computer, cloud agents run the code themselves and little human intervention is required."

👁 57K views ★ 5/5

View on X →

@mitsuhikoskeptic

Creator of Flask · husband and father of 3

"I did 10 calls with people now that shared their agentic coding experience. 7/10 reported non engineers vibeslopping code up. Majority said they moved to re-prompt all those contributions because it became impossible / too time consuming to work with those PRs."

👁 28K views ★ 5/5

View on X →

@akshay_pachaarbeliever

Co-founder @dailydoseofds_ · ex-LightningAI · 3 patents

"Only 1.6% of Claude Code's codebase is AI decision logic. The other 98.4% is operational infrastructure. As frontier models converge on raw coding ability, the quality of the harness becomes the differentiator, not the model."

👁 177K views ★ 5/5

View on X →

@0xlelouch_skeptic

Writes on scalability · building @0xffdevs · @indiehash

"I saw a junior intern dev ship a feature in one afternoon. 8 PRs, 40+ files changed, green CI, merged. Everyone clapped. Then I reviewed the PR. I asked one question: What happens if this endpoint gets called twice? Silence… The biggest bugs today aren't syntax errors. They're business bugs that pass tests and can leak money in production."

👁 115K views ★ 5/5

View on X →

@levelsiobeliever

Solo founder · multiple $10k+/m revenue products

"I work with Claude Code in production mostly, then sometimes I push to save for the day (as a backup kinda?). But since it's in production I don't need the deploy pipeline anymore… it just edits live code."

👁 83K views ★ 5/5

View on X →

@VicVijayakumarcontrarian

Principal engineer · "the fax guy"

"I come back, it's generated about 800 lines with passing tests… I tell the agent it's an idiot. It deletes 400 lines. I ask more questions. 'I made incorrect assumptions.' We're down to 200 lines… You are much better when I remember you're just autocomplete than when we both pretend you're intelligent."

👁 28.5K views ★ 4/5

View on X →

@aiamblichusskeptic

AI researcher · post-academic toolmaker

"The thing 'works', but the code quality is truly apocalyptic. If you do decide to verify everything they do, you will reduce your velocity by a factor of 10 at least. With human juniors, you at least have some time to react before they've written 100k lines of code and exhausted your token budget."

👁 28K views ★ 4/5

View on X →

06What the headlines hide

Eight tensions that complicate any "agents do / don't work" tweet.

The skeptics are still shipping with agents.

Karpathy, mitsuhiko, even aiamblichus — none of the loudest critics has stopped. The argument is "stop pretending they're senior engineers," not "stop using them."

The bottleneck moved from generation to review.

Multiple voices converge: humans are now the rate-limiter on PRs that compile, pass tests, and silently leak money in production. "The old slop had an owner. The new slop has an approver."

Harness >> model.

UCL reverse-engineered Claude Code: 1.6% AI logic, 98.4% harness. The believer playbook is overwhelmingly about CLAUDE.md, subagents, hooks, eval loops, sandboxes — not "which model."

"Works on legacy" is a dial, not a fact.

donnfelker says 10-year Java/Kotlin works fine — if you context-engineer it. bindureddy says complex human codebases are a nightmare. Both are right, depending on whether you paid the context-infrastructure tax.

The economics are still subsidized.

Cursor's $200/m plan reportedly costs them ~$5,000 in compute (up from $2,000 last year). melvynx burned $536 in 4 days on Cursor API. The ROI debate assumes today's VC-subsidized prices.

Comprehension debt is the new tech debt.

Jason Bosco (Typesense CEO): the failure mode isn't bugs, it's nobody understanding the codebase. A downward spiral where tests pass and humans can't tell good agent output from spaghetti.

"Production-ready" has a skill curve.

aakashgupta: "One prompt into Claude Code gives you AI slop. 21 specialized agents working together gives you a production app shipped to TestFlight." The believer/skeptic split largely tracks where you are on that curve.

The benchmark gap is closing — but slowly.

ArtificialAnalysis Coding Agent Index: best score is 61 (Opus 4.7 in Cursor CLI). Open-weights GLM-5.1 hits 53. Composer 2 hits 48 at 1/30th the cost. Progress is real; "solved" is not.

07Where it actually works · May 2026

A buyer's matrix for the May 2026 state of the art, grounded in the ArtificialAnalysis benchmark + the production stories above.

If you want…	Pick	Why
Best raw production score	Opus 4.7 in Cursor CLI (61)	ArtificialAnalysis Coding Agent Index
Cheapest production-grade agent	Cursor Composer 2 ($0.07/task, 48)	Same benchmark
Best open-weights	GLM-5.1 in Claude Code (53)	Top open-weight result
Speed-first	Opus 4.7 in Claude Code (~6 min/task)	Fewest turns per task
Multi-agent at scale	Composio Agent Orchestrator	30 parallel agents, worktree isolation
Marketing pages fast	Google Stitch 2.0 + Claude Code	Solves Claude's frontend weakness
Trustworthy on legacy	Claude Code + 3-tier memory	Codified Context paper / 108k LOC C#
Solo-founder live prod edits	Claude Code direct	levelsio's actual workflow
Reliability you won't regret	Eval-first → spec → code	synopsi: "90% of my time on evals"

08Methodology

How this was built

Recipe: x-discourse-research v1.2.0 — 8 parallel Grok X-search queries, each with a distinct angle.
Date window: 2026-02-12 → 2026-05-12 (90 days).
Engagement floor (hard filter): view_count ≥ 1,000 OR like_count ≥ 10 OR follower_count ≥ 1,000 (OR logic).
Credibility rubric: 1–5 score per post (shipping evidence + engagement signal + role).
Quote integrity: all quotes verbatim, all URLs working x.com links, full transcript available on request.

Raw posts 159

Passed floor 158

Pass rate 99.4%

Median views 23,425

Credibility ≥ 4 94 / 158 (59.5%)

Small-account viral 3

Stack #1 mentioned Claude Code (64)

Angles run in parallel 8