Measure Twice, Cut Once

This is the follow-up to "Claude Is the Mechanic. You're the Owner of the Car — and the Throttle Has Moved." That post was about the shape of the feeling — how AI changes what development feels like from the inside. This one is about the shape of the work: the rituals, the reviews, the flow from spec to deploy. What I think the new development process should actually look like. Some of it is conviction. Some of it is speculation I'd want to test in practice. I'll flag the difference.

Measure Twice, Cut Once

There's a rule every carpenter learns before they're allowed near a saw: measure twice, cut once. You check your work before you make it irreversible. Measure wrong and cut wrong, the wood becomes scrap.

I learned that rule early. As an auto body mechanic, later as a machinist, then as a model maker, then a mechanical engineer. In every physical trade I worked, the pre-work mattered more than the work. The cutting was the easy part. The measuring was the craft.

I think that rule matters more in AI development than it did in any of them.

AI made the cutting cheap. The code is fast. The rebuild is fast. The regeneration, when you realize you got it wrong, is fast. What's expensive is still what it always was: getting the intent right before you touch the saw.

Most of the AI-development conversation right now is about how to cut faster. I think the interesting question is how to measure better. This post is about what I think that looks like — the rituals, the reviews, the full flow from spec to deploy. Where the old carpenter's rule applies, and where it's changed.

The Old Rituals Don't Fit

Stop me if you've heard this before. You join a new team. They run two-week sprints, standups every morning, sprint planning on Mondays, retros on Fridays. Story points are estimated. Tickets move across a board. Feature branches are merged after one or two engineers approve them. Releases go out on a schedule. Staging environments exist for a reason.

That whole ritual stack was designed for humans coordinating with humans, at the speed a human can write and review code. None of it was designed for what we're doing now. And if you squint, every single piece of the old SDLC is under pressure from AI right now, in its own specific way.

Project management. Tickets were a coordination primitive for human teams. Factory AI Droids take tickets straight off Linear or Jira as their execution prompt. Notion 3.3's agents auto-summarize what happened overnight into async digests that replace the standup. The frame shifts from "coordinate the humans" to "coordinate the humans and the agents they're orchestrating." Story points die. They were a human-velocity metric that means nothing when the typing isn't the bottleneck.

Source control. Git Flow with long-lived branches was survivable when humans produced code at human speed. At AI output volumes, the big-bang merge becomes impossible — not hard, impossible. Trunk-based development with hour-scale branches and universal feature flags is the only model that holds up. This is, incidentally, the same call I made at Washington Post years before anyone was thinking about AI. It was the right call then because it removed a kind of friction. It's the right call now because the friction would otherwise crush us.

CI/CD. Your pipelines should self-heal. A failing build isn't a person's 3am problem anymore — it's a task for an agent that reads the failure, diagnoses the cause, and opens a PR with the fix. Tests that flake get triaged and rewritten by the agent. Deploys that anomaly-out get rolled back automatically. Every deploy assumes failure and plans for it.

Infrastructure. Infrastructure as Code (IaC) — Terraform, Pulumi, CDK — written from specs, not written by hand. Unused resources flagged by agents. Continuous cost audits. The infrastructure layer inherits the same "disposable, regeneratable" treatment as the code sitting on top of it.

Testing. Tests regenerate when specs change. Property-based and fuzz tests drafted by AI. The coverage metric — which always felt a little game-able — gets replaced by "failure modes enumerated." QA as a click-through-the-form role disappears; QA as evaluation harness design becomes more important than ever.

Incident response. The first responder is an agent. incident.io, Rootly, and AWS's own DevOps Agent are already shipping this in 2026. Humans get paged only for novel, high-severity, or business-sensitive incidents. The postmortem is drafted by the agent. The 3am page goes away for most of the paging volume.

The IDE. Steve Yegge predicted the IDE is dying by 2026. I don't think he's entirely right. I'm writing this in VSCode with Claude Code running in a side panel. Humans need a UI — any interface we work with is going to be one, and the IDE is the current shape of that UI for software. What's changing isn't whether the IDE survives. It's what the IDE is for. Less typing. More reviewing, pairing with the agent, inspecting diffs, tracking what the agent just did. JetBrains retired its traditional pair-programming features in March — not because IDEs are dying, but because the old role of those features was. The IDE's job is shifting from where code is written to where humans collaborate with the systems that write it.

If you pull back from the specific rituals, the through-line becomes clear. The old rituals were built to throttle humans working at human speed. Feature branches, review gates, deploy windows, change-advisory boards — every one of them was a safety valve because humans produce bugs and coordination problems at a rate that left time for other humans to catch them. AI doesn't work at that speed. Keep the old throttles and you waste the speed. Remove the throttles and you crash.

The redesign is: the throttles move, they don't vanish. From human-gates to agent-gates. From review-every-line to harness-enforced-quality. From slow-deploy to always-deploy-small-and-reversible. The discipline is the same. The location is different. And crucially, the location is one where humans can actually keep up.

Review the Tests, Not the Code

I want to spend a minute on code review specifically, because this is where the industry is most stuck.

Code review, as a ritual, was a proxy for quality. Someone wrote code. Someone else eyeballed the diff. If it looked right, it merged. If it didn't, the reviewer left comments, the author fixed them, the cycle repeated. In the pre-AI world this worked — slowly, but it worked. A senior catching a subtle bug in a junior's code was a real source of value. Knowledge spread. Consistency got enforced. Bad patterns got caught.

In the AI world, I think most code review is carrying weight for broken systems. And I think we should stop pretending.

Here's the split that I believe actually works:

Layer one: automated, no human. Linting. Type checks. Security scans. Dead code detection. AI review for obvious bugs. Test coverage thresholds. This catches something like 80-90% of what human line-by-line review used to catch. Greptile, CodeRabbit, Semgrep, knip — every piece of this is already shipping.

Layer two: test review, not code review. When a human looks at a PR, the most valuable question isn't "does this code look right?" It's "did we test the right things? Are these the edge cases that matter? Is the test suite going to catch a future regression?" Test review is higher-leverage than line-by-line review by a wide margin, especially when the code is regenerable but the tests are the durable contract with the system.

Layer three: strategic review, scoped. Architecture changes. Spec conformance. Debt creation. Cross-cutting concerns. Security-sensitive code. This is where senior eyes still matter — but they're focused on the small minority of changes that actually warrant them, not every line of every feature.

Most PRs don't need a human at all. Some need test review. A few need strategic review. Review the tests, not the code.

There's a prerequisite I want to be honest about. You can only do this if the rest of the system is strong: comprehensive automated gates, meaningful tests, feature flags, auto-rollback. If any of those is missing, skipping code review is catastrophic. Code review was carrying weight for broken systems. Fix the system, and the ritual can actually go.

Pair Deploy

If you accept what I've said about reviews, the next question is: what's the full flow? What does a healthy feature-to-production path actually look like?

Here's a version I want to try. This is where measure twice, cut once does the most work — because the cutting is now cheap and fast, and the measuring is the whole game.

Solution design stays. At Washington Post we did solution design docs for every epic — someone owned figuring out what needed to be built, talked to the PM or adjacent teams as needed, built a POC if the unknowns were big enough, then wrote a doc the team reviewed before any ticket got created. I think this ritual is more important in the AI era, not less — it's the first measurement. The doc is now load-bearing for two readers: the team who reviews it, and Claude, who uses it as implementation context. The acceptance criteria you write in the doc become the tests Claude runs against. The doc stops being ceremonial; it becomes executable.

Modern solution design loop: brainstorm with Claude, dispatch a research subagent if there's prior art to survey, Claude drafts the doc, you refine the judgment calls, the team reviews, the doc becomes the spec.

Then the build-and-ship flow:

Developer A builds the feature with AI against the solution design doc.
Developer B tests the feature — actually uses it, audits the tests, hunts edge cases.
Developer B fixes what they find. No ping-pong, no waiting on A. B directs Claude to resolve whatever came up and verifies.
Short sync back with A: "Found this, fixed it this way — check me?" A confirms, or flags "wait, I did that on purpose because..." and they realign.
Both agree. Deploy. Both names on the PR.

I'm calling this pair deploy, because it's the natural evolution of pair programming into a world where the typing isn't the bottleneck. Pair programming was hard to scale because two people worked on one keyboard — one driving, one navigating. You paid two salaries for one output stream. Pair deploying doesn't work like that. You're not pulling two people onto the same code. You're co-signing the outcome. That's cheap and it scales.

I want to be honest about why the symmetric agreement matters, because it took me a minute to figure out. When I first thought about this flow, I had the tester ship it alone. But at CloneForce, I was the one worried. I tested a teammate's work, they tested mine, and I found myself nervous that they might undo something I'd reasoned through carefully. The worry wasn't about hierarchy. It was human. The person who figured something out doesn't want their reasoning quietly overwritten, even by someone acting in good faith.

The symmetric sign-off solves it without adding latency. Both check, both ship. Two keys turned together — which is, not coincidentally, how high-stakes fields have always done this. Surgical teams confirm at checkpoints. Pilots co-sign takeoff. Nuclear ops literally requires two keys. The pattern isn't new. What's new is that the speed of AI-assisted work finally makes this affordable to do on every feature instead of only the critical ones.

A few caveats. High-risk changes — auth, payments, data migrations — still trigger Layer 3 strategic review. Pair deploy is the default for routine feature work, not the universal rule. And the cultural practices matter: both names on the PR, blameless postmortems, team-level accountability, and the language shift from "I fixed your bug" to "I closed an edge case." Same action. Completely different emotional content.

The Tech Stack Isn't a Wall Anymore

Here's something I don't see enough people writing about: AI didn't just make development faster. It collapsed the cost of working in unfamiliar tech.

The thing that used to gatekeep what a developer could build was what they knew. Python specialist? You built Python things. iOS developer? You stayed in Swift. JavaScript engineer? You did JavaScript, and anything else was a multi-month learning curve before you shipped anything useful.

That wall is mostly gone now. I've seen it, and I've been on the other side of it.

At Pearl, I was still on the old setup — Copilot and Sonnet. I built across the stack: AWS infrastructure, a Next.js frontend with websockets, a Python websocket backend without really knowing Python, and an audio processing pipeline I later migrated from one provider to another. I've built a full iOS health app without knowing Swift. At CloneForce, a colleague and I were JavaScript engineers at our core — with some Ruby and C# under our belts — but we shipped production Python backend APIs. None of this was our native tech. All of it shipped.

The shift is this: the barrier to working in a new stack used to be weeks or months of learning. Now it's a good specification and a tight feedback loop. Claude knows the language. Claude knows the conventions. Claude knows the standard patterns. What you bring is the problem-solving, the judgment about whether the solution is right, and the discipline to verify it works.

The real skill now is being the kind of developer who can take a problem and find a solution — regardless of what tech the solution lives in. I learned that long before AI. In R&D at a couple of my jobs, we had to solve problems where nobody knew what the answer looked like yet. That was always the job. Figure it out, ship it, verify it. The tech was secondary. The problem-solving was the craft.

AI doesn't remove the craft. It just removes the tech-familiarity tax on top of it.

There's an economic consequence too. A whole category of SaaS existed because building the thing was too expensive for most teams to justify in-house. Feature flags (LaunchDarkly). Search (Algolia). Event pipelines (Segment). Internal admin tools (Retool). Observability dashboards. Status pages, support tooling, CRMs shaped to how your team actually works. All of these were too expensive to build — which is why you paid the vendor. AI flipped that math. Weekend builds replace monthly bills. More importantly, the thing you build fits your team instead of bending your team around a vendor's opinions.

The caveats are real. Enterprise compliance, SOC2 audits, massive-scale reliability, multi-platform SDKs — these still favor vendors. If you're Netflix, you're not rolling your own feature flag system. But if you're most companies, the math that pushed you onto the big SaaS tools five years ago doesn't necessarily apply anymore.

Run the Math

I need to say something that cuts against where a lot of AI-dev conversation is going, which is "build agents for everything."

Agents are great. I love them. I've built them. But they cost real money to build, validate, and run. Token bills for a production agent system are not trivial. Claude Opus — today's best model — is expensive. And there is a flavor of 2026 startup that is burning tremendous amounts of capital building autonomous agent systems at current prices, essentially running funded R&D on behalf of a future where those prices have dropped tenfold. Some of them will have the runway to survive until the economics flip. A lot of them won't.

For most teams, today, the highest-ROI move is not "build a fleet of agents." It's a skilled developer with a premium Claude Code subscription. Two hundred dollars a month of tool spend, amplifying a developer who knows what they're doing, beats a custom agent fleet on virtually every metric that matters short-term. The developer ships product. The fleet is still in development.

Pay for the developer, not the fleet — unless the fleet protects the customer.

That caveat is the one that took me a minute to get right. Because at CloneForce I also worked the numbers on a scenario where the fleet is worth it. Consider a runtime auto-fix agent: an API error triggers Claude, Claude diagnoses it live, the user gets their success response (or a graceful fallback), and the agent opens a PR so the next deploy carries the fix. There's latency and tokens and error-rate risk to model. I ran it.

The surprising answer is that the math works. The right denominator isn't "tokens vs. developer time." It's customer retention and bug durability. A user who never sees the bug doesn't churn — and the LTV saved dwarfs token spend at almost any realistic error rate. Claude's fix is a one-time cost that amortizes across every future user who would have hit the same bug. The developer comes in the next morning to a pre-written PR to review, not a PagerDuty page at 3am — which is also a win for the developer's day, and a win for keeping them around. The fix happens while the customer keeps driving.

It's the software version of a run-flat tire. The tire self-seals enough to get you home or to the shop. No tow truck. No shoulder of the highway in the rain. The real repair happens later, on the engineer's time — not the customer's.

This is the same pattern as the Washington Post deploy pipeline: invest once in infrastructure, save every time the infrastructure fires. The economics aren't about tokens. They're about what the infrastructure protects.

The other agent I already built — a bug-report agent at CloneForce — runs every morning and pays for itself the first week. It pulls the last 24 hours of logs, walks the code paths, correlates errors across backend, frontend, and database, identifies root causes, and drafts a suggested fix. Before it existed: customer hits a bug, files a ticket, ticket queues up, developer picks it up, investigates, fixes, ships. After it existed: most bugs got fixed before a ticket was ever filed. The economics there weren't close. They were obvious.

That's the pattern I recommend. Build agents selectively, where the ROI is unambiguous. Overnight audits. Incident triage. Customer-facing fault tolerance. High-volume, repeated work where the one-time build cost amortizes across thousands of uses. Everything else, default to a skilled developer plus Claude Code.

And expect this to change. Token costs have fallen roughly 4x over the last 18 months. The economics are moving. The real flip, I think, isn't "when prices drop" generally — it's when today's Opus-level capability costs today's Sonnet-level price. That's probably mid-2027. When it happens, agent architectures become default infrastructure and the calculus here changes. Write for today. Expect to rewrite in a year.

What We Get to Build Now

I want to close where I should have started, which is: what do we get to do now that we couldn't before?

Because the grief is real. The craft is shifting. The rituals are changing. But the thing that genuinely excites me about this era — and the thing I wish more people were writing about — is the builder's opportunity. The space of things that one developer can make has exploded.

At CloneForce, I watched a teammate hack together an internal tool in a day. He handed it to another colleague, who ran with it for a few weeks, and what came out the other side was something like a small Blender app — a multi-panel internal application with its own workspace, purpose-built for our work. Full UI. Logs. Build processes. Testing. Advanced controls. Solo work. That would have been a team-years project five years ago. It was done by one person in a few weeks.

I watched our build cycle for new features collapse from two weeks to two days over the course of a few months. I remember turning to a teammate who was newer to the work and saying something like, "guess what we get to do now." Because the question was no longer how to make the typing go faster. It was what we could build now that the typing wasn't the bottleneck. That's a better question. That's the question people got into software to answer in the first place.

I keep coming back to one idea we never implemented but that I think is genuinely buildable today — a runtime auto-fix. An API error hits a live endpoint. Instead of returning a 500 and letting the user churn, the system hands the trace to Claude, Claude identifies the bug, the system serves a graceful success or fallback response to the user, and Claude opens a PR with the fix for the next deploy. The user never sees the bug. The bug dies once. The engineer reviews and ships. I didn't get to implement it at CloneForce. But it's the kind of thing that is possible now, and wasn't possible eighteen months ago, and is the kind of thing that makes me excited about what's next.

Another one I keep thinking about: giving Claude full context of your development state. Today, that context lives scattered across Jira, Linear, Vercel, Sentry, GitHub, Slack, and the raw log streams of every service — and mostly in developers' heads. When someone decides to flip a feature flag, they're doing it with partial information and crossing their fingers. With Claude connected to all of it — including continuous log access, not just error reports — decisions about rollouts get smarter by default: "Epic A is done and behind a flag; ramp to 100% is safe today, but Epic B depends on it and needs A stable for 24 hours first." Or: "Logs from the payment service show 99th-percentile latency drifting up over the last hour. Hold the flip." The MCP servers for Linear, GitHub, Vercel, Sentry, and most log platforms already exist. The piece that doesn't exist yet is the unified, continuously-synced view of "here is the state of our development world" that Claude can query on every decision. Someone is going to build that. It's one of the more valuable unbuilt pieces of the 2026 stack.

The joy isn't polishing functions. Not anymore. The joy is shipping things that weren't shippable. Solving problems that weren't solvable by one person. The developers I see who are lit up right now aren't the ones who love every line of code. They're the ones who love what code can now do, and who are excited about building it.

The developers who are miserable right now — who show up in that Stack Overflow survey — are, I think, standing in the doorway still holding the old tools. I'm not unsympathetic to that. The old tools were good. I used them too.

But there's a bigger room on the other side of the doorway. And the work of getting there isn't about pretending the grief isn't real. It's about:

Redesigning the rituals so they fit what AI actually does well.
Building the harness so you can trust work you didn't type.
Running the math so the agents you build actually pay off.
Naming the pair-deploy kind of collaboration so teams don't worry someone's quietly undoing their work.
And keeping the philosophy that got us here: remove the friction, and the developer gets to do the work that actually matters.

I don't have all the answers yet. I have a process I believe in enough to propose, and a deep belief that developer happiness still drives good work — and that the shape of that happiness is shifting from "the craft of writing code" to "the craft of deciding what to build and knowing it works." That's not a smaller craft. It might be a bigger one. It's certainly the one I'm practicing now.

The carpenter's rule still holds. Measure twice, cut once. The measuring moved to the spec, the harness, the test review, the sign-off. The cutting got cheap. The craft lives in the same place it always did — in the work you do before the saw touches the wood.

What we get to build from here is interesting. That's the whole point.

— Bill

Measure Twice, Cut Once

AI made the cutting cheap. The measuring is where the craft lives now.