Cost to Implement vs Cost to Verify

March 27, 2026 · 11 min read

The Wrong Scoreboard

The discourse on coding agents has been obsessing for the past year over the wrong question. The main focus has been what models can do: lines written, autonomous minutes, benchmark scores, model cards, percent of lines shipped by AI. These are all generalized measures of implementation throughput. Useful for a bird's-eye view of model progress, but they say almost nothing about where the actual bottlenecks now live. The operative question for practitioners in 2026 is not what tools can do, it's what you should ask them to do.

Answering the should question requires a different lens than the capability benchmarks provide. Every task you might hand to a coding agent has two costs that matter: the cost to implement (C_i) — the time and expertise needed to produce the code — and the cost to verify (C_v) — the time and expertise needed to confirm the code is correct. The relationship between these two variables determines whether delegation is a net win or a liability.

Aside: About "Delegation"

When I first outlined this in November 2025, I was comparing handcoded vs AI-delegated implementations. My workflow has changed significantly since then: I rarely hand-write code.

The relevant choice for me is now between pair programming with the agent (high-touch, Socratic, every structural decision is guided) and delegating (agent leads research, planning, and implementation; you just review the output of each phase). The pair programming model is mentally just as involved as writing code, but mechanically faster. The delegation model is now very different, allowing you to run and ship five separate feature PRs in parallel (not some clickbait Xitter "I ran 100 agents in parallel today" make-work slop, but five actual product increments, in parallel, in a brownfield codebase).

Whatever the threshold of delegation is, in my experience the framework below applies.

The Two-Variable Framework

When both costs are low, it doesn't matter what approach you take — the task is trivial either way. When C_i is high but C_v is low, delegate freely; the implementation is a job for the agent, and you can cheaply confirm the result. The inverse is equally clear: when C_i is low but C_v is high, build a detailed mental model by taking part in every step of the process.

The dangerous quadrant is top-right. When both costs are high, there's a huge incentive to spin the slot machine many times, and see if the agent just happens to nail the task. Compared to hand coding, where you burn days or weeks to ascertain the quality, the agent might have a chance to succeed at the same or higher quality after just 60 minutes of work. For complex or off-distribution work, it may be a small chance... but that makes it even more tempting!

By skipping the mental effort, you go in blind on an equally demanding task: verification. This is the trap. The models have dramatically compressed C_i across the board. C_v has not moved at the same rate — and in many cases, without careful developer intervention, it has gotten worse.

Vibecoding and the Unaddressed Variable

Vibecoding is the logical extreme of treating C_i as the only variable. Previously, architecture decisions were bottlenecked by implementation cost. Releasing that constraint completely, without addressing verification cost, is a big failure mode. Any frequent flyer on Claude Code has experienced this, as an end user of an entirely AI-coded application — the constant issues with UI bugs, unintended changes to history cells, broken permission models... I've written about the flickering issues before, and I've been annoyed that sandboxing persistently pollutes the workspace with empty files (this issue has been recurring in different forms for three months now). Users of the Claude Web environment, of Cursor, and many other almost fully AI-coded products experience the exact same degradation of quality as the software grows so rapidly. It's not just that more features lead to proportionately more bugs. When you don't build a mental model of the codebase, you've skipped your first pass on verifying the logic, and you've gone without a map of what parts need verification. The consequence isn't just bugs — it's verification blindness: you don't know what you don't know.

This is a common failure mode that many teams have fallen into, particularly startups that feel the keenest urgency to ship faster.

Verification Debt

As a result, the most common form of tech debt in these highly agentic codebases comes from growing your feature surface area too fast and too loose. Every agentic feature shipped without a corresponding verification investment degrades your ability to autonomously ship future features. This is a compounding liability, not a fixed cost — first it accumulates, and then because these changes can have cross-cutting technical concerns, or act as bad examples for future work, it compounds. Unit and integration testing become slightly more important, to compensate. But E2E behavioral verification becomes far more important, because that's the layer the agent generally cannot self-evaluate on its own. Skipping this investment creates verification debt.

Detour: What Spec-Driven Development Gets Right (and Wrong)

The popular framing of spec-driven development is wrong on two counts. It's not about making prompt copy-paste easier, and it's not about closing the loop on "Ralph Wiggum" workflows — generate, test, regenerate. These framings chase short-term speedups that don't touch the real bottleneck.

The original insight of a specification is much more important: you cannot verify an implementation if you don't know its intent. A specification is the textual or symbolic description by which different readers arrive at the same mental model. In distributed teams, software has long relied on PR review, ADRs, box and sequence diagrams — this is fundamentally the sharing of intent. You must know the developer's intent before you can review their outputs. This is not new, it's just now more urgent.

The genuine unlock of specs is that they literalize the behavior you need to verify. Once combined with simulation environments — headless browsers, terminal puppeteering, API smoke tests — your specs become the instructions for agentic verification after agentic implementation. The loop closes not at the generation layer, but at the verification layer.

Beware Reflexivity

These two variables don't stay independent — they affect each other over time. Shipping too quickly raises C_v as verification debt accumulates. Higher C_v in turn raises future C_i — the agent's implementation speedup erodes as the codebase becomes harder to reason about, harder to test against, and bad patterns get committed and cargo culted. This is the mechanism by which agentic coding gains can crash down to earth, trash a codebase, and turn a team against AI coding tools entirely.

But reflexivity runs both ways. The virtuous path runs in the opposite direction: to properly ship faster, teams must aggressively use the low cost to implement new tooling as a lever to decrease the cost of verification.

Solutions: Making Verification Tractable

Three concrete approaches, in increasing specificity:

1. Simulation environments at full breadth.

Standard integration/automation/E2E testing practices are table stakes, but the agent's reach is now wider.
Headless browsers, Chrome DevTools protocol, pseudo terminals, API smoke tests, microVMs, containers — the agent can drive all of these.
The question is no longer can we automate this but have we specified what to automate.

2. Making the runtime legible.

E2E tests don't capture everything: service startup timing, internal program state, functional SLAs, "no UI interaction blocks for more than 2s".
An ephemeral, per-worktree observability stack — logs, metrics, traces — makes runtime behavior tractable to the agent.
This is the difference between the agent knowing the tests pass and the agent understanding how the system is behaving.

3. Bespoke verification tooling is now nearly free.

Personal example: while building a PTY proxy around several TUI tools, codifying system invariants and nightly fuzzing jobs cost only ~two extra hours.
I implemented fuzzing with reproducible seeds and system state captures at failure.
I can put my tool through more comprehensive testing than I could have ever justified without the agentic contributions.
The economics of verification tooling have shifted — the agent makes it cheap to build guardrails and course correction, so the only remaining question is where those guardrails should go.

The New Discipline

Coding agents haven't changed what good software engineering is, but they have changed where the leverage point is. The developers extracting the most durable value from these tools are the ones who have reinvested implementation gains into verification infrastructure. The question to ask before every agentic task is not "can the agent build this?" but "how will I verify what was built?"

If you want two immediate, concrete steps to improve your verification tools:

After you run through a cycle of research-plan-implement, your agent must go through the implementation using TDD. Here's a general purpose agent skill to do so.
If you want an agent to go through a manual end-to-end test on your project, consider giving it the tools to interact directly. Here's a skill for the agent to puppet a browser with Playwright. Here's a skill for the agent to puppet a TUI with tmux.

I am hoping that this will be just the first of three core posts about the new practices in software engineering. To signpost properly where I think this is going, here are the three ideas that are changing how I think about software development:

Verification debt. (this post)
Agent legibility.
Compounding correctness.

The Wrong Scoreboard​

Aside: About "Delegation"​

The Two-Variable Framework​

Vibecoding and the Unaddressed Variable​

Verification Debt​

Detour: What Spec-Driven Development Gets Right (and Wrong)​

Beware Reflexivity​

Solutions: Making Verification Tractable​

The New Discipline​