Why Your Team's AI Coding Tool Breaks at Scale (And It's Not the Model)
Every team benchmarks AI coding tool quality. Almost nobody asks where the control surface lives or who owns shared state when the session ends. That second question is the one that breaks teams at scale.

Three control surfaces, three valid choices. The decision is architectural, not a quality ranking.
Boris Cherny built Claude Code, currently one of the most-discussed AI coding tools. He also said his setup is "surprisingly vanilla." No elaborate prompt engineering. No secret model config. No custom system prompt that explains the tool's power. "There is no one correct way," he wrote. The thing that actually matters is the CLAUDE.md, the permission configs, the verification loops, and the hooks the user builds around the tool. [4] [5]
That is a strange confession if you are in the middle of a Cursor-versus-Claude-Code debate. The creator of one of the tools is telling you the differentiator is not what the tool does by default. It is what you build around it. Where control lives. Who holds it.
Teams spend weeks on the wrong side of that distinction. They compare rework rates, test benchmark scores, and read through Product Hunt threads looking for the verdict on model quality. Those comparisons are real. The data in them is worth reading. But they are the second question. The first question never gets named: where does the human's steering point need to live, and who owns the durable record of the work when the session ends?
That first question is where teams break at scale. I have written about why AI coding agents lose context across sessions before: the session gets long, state drifts, the agent starts missing constraints it already saw. This post is the upstream version of that problem: not how the session fails, but which tool architecture sets you up for it.
Everyone Benchmarks Output Quality. That Is the Wrong Race.
The community discourse on AI coding tool selection is almost entirely a quality conversation. Cursor or Claude Code? Which has fewer reworks? Which writes cleaner code on the first pass?
The Product Hunt thread "Cursor or Claude Code?" has the kind of data that makes this look like a settled question. Optibot's numbers, cited by commenters, put Claude Code at roughly 30% fewer reworks. That is a meaningful gap. [1] And yet the thread is full of developers who acknowledge the data and still resist switching from Cursor. The resistance is not irrational. Read the comments closely and you find the actual reason: they do not want to lose the in-editor control surface. They are not choosing the worse tool. They are choosing the tool that fits where their steering point needs to live.
The second thread, titled "AI in your IDE vs AI in your terminal," frames the split more honestly: it is a control-surface preference, not a quality preference. IDE-native visual diff versus terminal-driven autonomous runs. [2] The community already feels the distinction. They just do not have a name for it.

Birgitta Böckeler's analysis of spec-driven development tools makes the same point from a different angle. Her teardown of Kiro, spec-kit, and Tessl shows that the tools differ most meaningfully on where specs live and who owns the state. Not on model output quality. Spec drift, the failure mode where agents ignore instructions in large contexts, happens downstream of a state-ownership problem. The quality question is already downstream of the state question. [6]
If you benchmark model output without first deciding where control lives and who owns shared state, you are measuring the wrong variable. The 30% rework gap is real. It also collapses when shared state drifts.
What "Control Surface" Means for AI Coding Tools
Control surface is the place where a human's steering actions live. In a coding tool context, it has a specific meaning: where can the human see what the agent is doing, redirect it, constrain it, and verify it? And when the session ends, what persists?
There are three types in current practice.
IDE-native. The steering point lives inside the editor. The human sees diffs, inline suggestions, and agent output in the same pane where they write code. Shared state is the open file and the in-editor context window. When the session ends, nothing persists beyond what was committed to the repo. Cursor is the reference implementation of this type. [1] [2] The control surface is tight, visible, and co-located with the work. The cost is that it is also session-local.
Terminal / agent-first. The steering point moves to the terminal or a shell-level instruction surface. The agent takes longer autonomous runs: it reads and writes files, executes commands, manages context across many steps. The human does not watch every diff. Instead, they build guardrails around the tool: CLAUDE.md, permission configs, hooks, verification loops. Cherny's own setup is the evidence here. He is not working from a more powerful default configuration. He is working from a more intentional control surface that he built himself. [4] [5] HN practitioners confirm the pattern: they keep "hands on the wheel," treat agents like junior developers, and keep the durable contract in organized in-repo markdown. [3] The control surface is more spacious. The cost is that it requires the human to build and maintain it.
External board over MCP. The steering point is outside both the editor and the agent. Shared state lives in a queryable external store: a task board, a workflow engine, a project ledger that the agent connects to over a protocol. The session can end at any point. The state was never inside the session to begin with. Böckeler names this dimension in her teardown: the meaningful distinction between tools that externalize the spec as the source of truth versus tools where the agent owns state in-context. [6] The control surface is durable and team-accessible. The cost is that it requires a separate integration layer.

The taxonomy is not a ranking. Each type has a place. The problem is not which type teams choose. The problem is that they usually do not choose. The control surface decision gets made by default, by whichever tool a team adopts first and however many tools accumulate over time without a shared state strategy.
Shared State Is the Part That Actually Breaks
Running one AI coding tool is manageable. Running three (which is what most engineering teams end up doing once AI coding becomes part of the standard workflow) creates a shared-state problem that model quality benchmarks do not surface.
The HN "spec-driven development" thread captures the manual version of this clearly. Developers keep the durable contract in organized in-repo markdown: CLAUDE.md, AGENTS.md, ARCHITECTURE.md files that serve as persistent memory for the agent. [3] That works for a solo developer running one tool in one repo. It compounds at team scale when multiple people, multiple agents, and multiple tools all need to read and update the same work state.
Böckeler's analysis adds the technical detail: spec drift occurs when agents ignore instructions in large contexts. The percentage of sessions where the agent misses a constraint is not a fixed property of the model. It is a function of how well shared state is maintained. When state drifts, quality benchmarks drift with it. [6]
Boris Cherny described the fleet-scale version of this in Fortune. At a certain scale, the human role shifts from prompting to "running loops over fleets of agents." [7] You are no longer steering one session. You are operating a system. A context-window-only shared state does not survive that transition. The control surface stops being a UX preference and becomes an operational contract.
The counterargument worth addressing directly: model output quality matters. A lot. This post is not arguing otherwise. A tool that writes worse code is harder to use regardless of where the control surface lives. The claim is narrower: model quality is a second-order variable. The same model performing at quality X in a well-maintained shared-state environment will underperform in a drifted one. The community's resistance to Claude Code despite documented quality advantages is the evidence. The buyers are not ignoring the data. They are sensing a control-surface mismatch and do not yet have a frame for it. [1]
For the multi-agent version of this problem, where the handoff contracts between agents are the breaking point, the role-separation post covers the architecture in detail. The shared-state problem at the single-tool level described here is the prerequisite to that one.
Two Questions That Decide the Choice
The model benchmarks are worth reading. Read them after you answer two questions that they cannot answer for you.
Question 1: Where does the human's steering point need to live?
In the editor, in the terminal, or outside both?
This is a team workflow question before it is a tool preference. A team that reviews code through in-editor diffs and inline comments will not adopt a terminal-first tool regardless of rework rates. The control surface is incompatible with their review process. A solo developer working in long autonomous-run bursts does not need an IDE pane watching every suggestion; they need good guardrails around an agent that can run unsupervised.
Neither group is wrong. They need different control surfaces.
Question 2: Who owns shared state when the session ends?
If the answer is "nobody" or "the context window," state drift is accumulating invisibly. Plans, constraints, task status, and approval records that existed in one session do not automatically transfer to the next one, or to a different tool, or to a different person on the team.
The choice of control surface determines whether state is scoped, queryable, and owned, or implicit, session-local, and lost.
The decision rule, stated as my inference from the evidence above:
- Teams that need visual control, in-editor diffs, and tight human oversight per change: IDE-native tools fit the control surface they already use.
- Developers working in autonomous-run bursts who will invest in building their own guardrails: terminal/agent-first tools are intentionally open for exactly that. The vanilla setup is the starting point, not the ceiling. [4]
- Teams running multiple people or multiple agents who need a shared, durable record of what is open, blocked, approved, and changed: the third type of control surface is the missing layer. An external shared-state store that agents connect to over MCP (Agiflow is one concrete instance of this pattern) is not a coding tool. It is the persistent state layer any of the above can read and update without the session being the contract.

If you are handing this decision to someone else on your team, the two questions are the right place to start. Send this to the person who owns your team's AI tooling decision.
The Model Benchmarks Are the Second Question
Pick the AI coding tool that fits the control surface your team will actually use. Then pick the model.
Teams that skip the first question end up with the visible problem: good model quality, growing state drift, and unclear handoffs between sessions, tools, and people. The benchmark that looked decisive turns out to be measuring a second-order variable.
The control surface and shared-state questions get made by default if you do not name them. The default is usually "whichever tool ships the best demo in February." That is a fine way to start. It is not a policy.
If you are evaluating spec-driven development tools specifically and want to apply the same lens to that ecosystem, the companion post maps where AI project memory lives across every major SDD tool. The framing is the same. The tools are different.
TL;DR
- Teams pick AI coding tools by model quality. The decision that breaks them at scale is different: where the control surface lives and who owns shared state when the session ends.
- Three control surface types exist in practice: IDE-native (in-editor steering, session-local state), terminal/agent-first (guardrails built by the user, in-repo shared state), and external board over MCP (durable state outside both agent and repo).
- The same model performing at quality X in a well-maintained shared-state environment will underperform when state drifts. Model benchmarks are a second-order variable.
- Two questions before any benchmark: where does the team's steering point need to live, and who owns shared state when the session ends?
- The model quality gap is real. Answering the control-surface question first is what makes it matter.
References
- Product Hunt: "Cursor or Claude Code?" community discussion. https://www.producthunt.com/p/cursor/cursor-or-claude-code. Captured 2026-06-24. Community claim: Optibot data cited in-thread showing Claude Code ~30% fewer reworks; commenters resist adoption over loss of visibility/control.
- Product Hunt: "AI in your IDE (Cursor) vs AI in your terminal (Claude Code): what's the better flow?" https://www.producthunt.com/p/vibecoding/ai-in-your-ide-e-g-cursor-vs-ai-in-your-terminal-claude-code-what-s-the-better-flow. Captured 2026-06-24. Community claim: the Cursor/Claude Code split framed as a control-surface flow preference, not a quality difference.
- Hacker News: "Ask HN: Are you still using spec driven development?" https://news.ycombinator.com/item?id=46864948. Captured 2026-06-24. Practitioner claims: developers keep "hands on the wheel," treat agents like junior devs, and keep the durable contract in organized in-repo markdown.
- Boris Cherny (@bcherny): X post on his Claude Code setup. https://x.com/bcherny/status/2007179832300581177. Captured 2026-06-24 (post ~June 2026). Creator claim: "Surprisingly vanilla," "works great out of the box," "no one correct way"; control lives in CLAUDE.md, permissions, hooks, and verification loops the user builds around the tool.
- Karo Zieminski: "How Boris Cherny Uses Claude Code," Product with Attitude (Substack). https://karozieminski.substack.com/p/boris-cherny-claude-code-workflow. Captured 2026-06-24. Analysis corroborating [4]: power and responsibility "live together in the user's hands"; guardrails (CLAUDE.md, permission configs, verification loops, hooks) rather than model defaults.
- Birgitta Böckeler: "Understanding SDD: Kiro, spec-kit, Tessl," martinfowler.com. https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html. Published 2025-10-15. Verified authoritative source: tools differ on where specs live and who owns state; spec drift occurs when agents ignore instructions in large contexts.
- Fortune: "Anthropic's Boris Cherny, creator of Claude Code, says there are days he manages tens of thousands of AI agents at once." https://fortune.com/2026/06/08/anthropics-boris-cherny-creator-of-claude-code-says-there-are-days-he-manages-tens-of-thousands-of-ai-agents-at-once/. Published 2026-06-08. Verified publisher: at scale the human role shifts from prompting to running loops over fleets of agents.
More to read
Spec-Driven Development Tools: Where AI Project Memory Lives Is the Only Decision That Matters
The spec-driven development tool landscape grew from 6 to 13 tools in a single community repo, not because the field is converging, but because three communities are building on incompatible assumptions about where AI project memory should live.
11 min read5 Dimensions That Separate a Real MCP Integration From a Read-Only Wrapper
Nearly every PM platform ships an MCP server now. Here is a 5-dimension framework for telling whether a given integration will change how your team works or just add a checkbox.
10 min readOne Solo Dev Shipped a SaaS in 30 Days. Here Is What a Claude Code Project Board Fixed.
A solo developer shipped a productivity SaaS in 30 days with Claude Code. By week four the AI was making locally correct decisions that contradicted each other globally. Here is what broke, why the standard fix has a ceiling, and what an MCP-native board does differently.
10 min readPut this project board inside ChatGPT
Open Agiflow in ChatGPT to plan campaigns, create tasks, and check what needs attention. Create a free Agiflow account when you are ready to keep the board for your team.