Six Bugs My Parity Harness Caught in a Day

14 min read

I built an AI-assisted parity harness to port an iOS app to React Native. The bugs it found weren't in any docs.

Six Bugs My Parity Harness Caught in a Day - Featured image showing AI Engineering related to six bugs my parity harness caught in a day
Bill John Tran

I've been writing a lot about harness engineering — the idea that when AI is doing most of the coding, the structure around the AI is the engineering. The model isn't the answer. The harness is.

A few weeks ago I decided to stop just writing about it and build one.

The target was my own iOS app, Nervous System Monitor. I wanted to see if I could port one screen — the Sleep section — from Swift to React Native using AI to do the actual coding, with a harness that would tell me whether the port was correct. Not "did the tests pass," but did it behave the same as the original.

I gave myself a day. The harness shipped. The port shipped on one screen. The reports it generated are still some of the most useful documentation I've produced from a single session of work — because the harness didn't just verify the port. It caught six bugs I would not have caught any other way — and one realization I almost missed.

I want to walk through those bugs, because I think they make the case for harness engineering better than any thesis I could write.

What the Harness Does

Quick version. The harness validates four things in parallel after every run:

  • Visual — screenshot the iOS app, screenshot the React Native app, pixel-diff them
  • Behavioral — capture the iOS accessibility tree, capture the React Native one, diff structurally
  • Data — both apps export their state as JSON; compare key-by-key
  • Golden — for pure-logic functions, generate (input, expected) fixture pairs from the Swift original, run the TypeScript port against the same fixtures

When all four oracles report green, parity is real. When one fails, the agent reads the diff and iterates.

The Swift app is the parity oracle. It's never modified. The harness is what mediates between "AI generated some React Native code" and "the React Native code behaves like the Swift code."

That's the whole pattern. Now the bugs.

Bug 1: accessible={true} Silently Collapses the Tree

I set accessible={true} on a container View in the React Native code because I thought it would help the screen reader. It did the opposite — and it broke the harness.

What I didn't know: setting accessible={true} on a parent View in React Native tells iOS to collapse all the children into a single accessibility element. From the user's perspective, the screen looked fine. From the screen reader's perspective, the children were gone. And from Maestro's perspective — the test driver the harness uses to read the screen — the children were also gone.

assertVisible: "Sleep" passed when I eyeballed the screen. It failed in Maestro.

This is the kind of bug that ships and stays shipped. It looks right. It tests right against any visual test. It fails silently in production accessibility. The harness caught it because the harness checks the accessibility tree as a first-class oracle, not as an afterthought.

Lesson: never set accessible={true} on container Views in React Native. Use accessibilityIdentifier instead, which doesn't collapse the subtree.

Bug 2: Concatenated Text Nodes Break Exact-Match Assertions

The Swift original showed 94 A as one visual unit — a score and a grade letter. I rendered it in React Native as:

<Text>{f.scoreOverride}{'  '}{f.gradeOverride}</Text>

Visually identical. From the accessibility tree's perspective, this is one Text node containing "94 A". So assertVisible: "94" failed, because Maestro looks for an exact text match and the only node with 94 in it was actually "94 A".

I split them into two <Text> nodes. The harness went green.

This is a class of bug I'd never have noticed without the behavioral oracle. The visual diff would have been fine. The data layer would have been fine. The accessibility tree was the only place the bug lived.

Lesson: each assertable value needs its own Text node. Don't concatenate values into one string with whitespace, no matter how natural it looks.

Bug 3: Maestro 2.5.0 Changed Where Screenshots Land

The harness uses Maestro's takeScreenshot step to capture screens for the visual oracle. In Maestro 2.5.0, the behavior changed: takeScreenshot: <name> now writes <name>.png to the current working directory, not into ~/.maestro/tests/<run-id>/screenshots/ where the comparator was looking.

The comparator silently couldn't find the screenshot. The visual oracle reported "no candidate image" instead of "image looks wrong."

This is a Maestro behavior change that isn't loud in the release notes. The harness caught it because the harness fails closed: if any oracle's input is missing, parity fails. A flow that quietly stopped producing screenshots would have looked like a passing run otherwise.

Lesson: name the screenshot step to match the truth filename so the comparator lookup is trivial. And design oracles to fail closed when their inputs go missing.

Bug 4: maestro hierarchy --output Doesn't Exist Anymore

Same release. The behavioral oracle was using maestro hierarchy --output <path> to dump the accessibility tree as JSON. That flag was removed in Maestro 2.5.0. The command silently ignored it and wrote the tree to stdout instead.

The comparator had a || true swallowing the error, which meant the tree diff just quietly skipped. Every parity report was reading "behavioral diff: SKIPPED" and I wasn't reading it carefully enough.

The harness caught it because the parity check failed overall — visual was failing for unrelated reasons — and I went looking. If visual had been passing, I'd have shipped a "fully green" report with the behavioral oracle silently inert. That's an outcome the harness shouldn't allow.

Lesson: never || true the output of an oracle. If an oracle can't run, the parity check should know about it, not paper over it.

Bug 5: Repeated Automation Runs Crash the iOS Simulator

After a few back-to-back parity runs, the iOS Simulator's SpringBoard process segfaulted with XCTAutomationSession initWithAccessibilityFramework. Maestro started timing out. The harness reported "Maestro flow failed" even when the React Native code was fine.

This one isn't a bug in my code. It's a quirk of how iOS Simulator handles repeated automation sessions, and it's been reported elsewhere — but I didn't know about it until I hit it. The fix is to shut down and re-boot the sim between sessions:

xcrun simctl shutdown booted && xcrun simctl boot <device-id>

I added that to the harness's pre-run script. False failures dropped to zero.

Lesson: the harness is only as reliable as the environment under it. When you're driving real hardware (or real simulators), expect environment failures and design the harness to recover from them rather than report them as your code's failures.

Bug 6: A Truth-Directory Naming Convention I Forgot About

The comparator filters truth directories by name.startsWith('nsm-') — it expects each captured iOS state to live under truth/<feature>/nsm-<sha>/. I created a new truth set as truth/sleep-section/3949049/ and the comparator silently couldn't find it.

This is a "documentation drifted from the code" bug. The convention was real, written in the design doc, enforced by a string check, and I forgot about it when I ran the capture script. The harness reported no truth, treated the parity check as "nothing to compare against," and would have reported green if I hadn't been watching.

Lesson: if you're going to filter directories by name, enforce the convention at capture time, not at compare time. The capture script should refuse to create a truth directory that the comparator can't find. The harness now does.

Why These Are the Bugs That Matter

Look at the six bugs again:

  1. Accessibility subtree collapse
  2. Text node concatenation breaking assertions
  3. Maestro 2.5.0 screenshot path change
  4. Maestro 2.5.0 hierarchy flag removal
  5. Simulator SpringBoard crash on repeated automation
  6. Truth directory naming convention drift

None of these would have surfaced from running the tests. The React Native unit tests passed. The Swift unit tests still pass. The visual diff was misleading at first because of dimension mismatches. The TypeScript compiler was happy. ESLint was happy.

These are bugs that live in the seams — between the AI-generated code and the runtime environment, between two versions of a test tool, between a directory naming convention and the script that reads it. The harness caught them because the harness watches the seams. That's its whole job.

This is the class of bug that traditional tests can't catch — because traditional tests assume the environment is stable. In AI-assisted development, the environment is part of the system. The harness has to know it.

The Realization I Almost Missed

There's a seventh thing the harness taught me. It wasn't a bug. It took me longer to see than the bugs did, and I think it matters more.

After I'd fixed the six bugs, the visual oracle was still reporting 99.32% mismatch. That number is technically correct — the iOS truth screenshot is 1290×2796, the React Native candidate is 1170×2532, and odiff is comparing pixel-by-pixel at matching coordinates. Two different-sized images of different things will always come out close to 100% red. The diff image itself was a wall of useless color.

But I had the three images sitting next to each other on screen — truth, candidate, diff — and within five minutes of just looking, I had the gaps named. Chart too tall. Lane labels on the wrong side. Score badge unstyled. Chevron missing on the Duration row. The semantic visual comparison odiff couldn't do, I was doing in my head as fast as I could scroll.

Out of curiosity I pasted the same three images into Claude and asked what it saw. Same answer. Same gaps. Same speed.

That's when it clicked. I was doing the visual diff better than the tooling. So was Claude. And the tools — Maestro, odiff — weren't wrong. They were incomplete. They couldn't express the layer humans actually use when reviewing UI: semantic visual judgment. "Same data, different layout" isn't a pixel count and it isn't a text assertion. It sits between them, and the established tooling doesn't have a place for it.

The bigger point, the one I'm still sitting with: old tools aren't necessarily wrong tools, but they can be missing a layer that wasn't available when they were built. Maestro is still the tool I need for what it does — it drives the simulator, takes the screenshots, runs the assertions, produces the artifacts no other tool produces. odiff is fine at what it does. AI doesn't replace either of them. AI sits between Maestro's output and the next code change, using the artifacts as context for the next iteration. The pipeline isn't "four oracles in parallel." It's Maestro as the substrate, AI as the cognition layer that turns Maestro's artifacts into code-ready guidance. Better context in, better code out.

And the further part: this isn't a one-off. It belongs in the loop. Every Maestro run already produces structured outputs — screenshots, JUnit results, hierarchy data, assertion failures. Today, all of that lands in result.json and stops there, waiting for a human to interpret it. The natural next step for this harness is to bake that interpretation into the loop itself. Same loop, but the analysis step that used to be human-only becomes a real layer in the harness.

If I'd just read pass: false, mismatch: 99.32% and moved on, I'd have missed it. The insight came from sitting with the artifacts.

The Wrong Optimization

That shift — AI as the cognition layer between substrate and code — is the thing the agent paradigm tries to flatten.

Most of the published work in this space is about agents — autonomous AI systems trying to do whole jobs end-to-end. Multi-agent setups with planners, coders, critics, evaluators, all trying to remove the human from the loop. Google's TensorFlow-to-JAX migration paper built one and explicitly declined to validate numerical parity because it was "too slow." Microsoft's Oracle-to-PostgreSQL agent runs a self-healing critic loop on a single oracle. The literature is about cranking the autonomy dial up.

I think that's the wrong optimization.

AI iteration is expensive. Tokens cost money. Retries cost time. Bad directions cost both. A multi-agent system trying to simulate human judgment with more AI is paying for it in tokens. Meanwhile the human in the loop is the cheapest, fastest reasoning resource in the system. Human judgment costs nothing per use. No bill, no rate limit, no retries. And the human can think while the AI is running — parallel cognition, not serial. One smart redirect saves ten expensive iterations.

The numbers back this up in ways that surprised me. Claude Code Max at the $200/month tier gives you something on the order of 240 to 480 hours of Sonnet activity per week, plus 24 to 40 hours of Opus. That's flat. No per-use bill. A developer on that plan pays one number and works for a month.

Now look at the other side. In April 2026 Anthropic banned third-party agent frameworks (OpenClaw and similar) from running on consumer Claude subscriptions. Their reason: a single autonomous agent running unattended for a day was consuming somewhere between $1,000 and $5,000 in equivalent API costs. Anthropic's own estimate, in their own announcement. Boris Cherny, who runs Claude Code at Anthropic, put it plainly: "Anthropic's subscriptions weren't built for the usage patterns of these third-party tools."

That's the gap. A human-in-the-loop using Claude Code Max pays $200 a month flat. An autonomous agent on the same underlying model can burn $1,000 to $5,000 in a single day. One to two orders of magnitude, every day, on the same model. Anthropic — who builds the model and prices it — concluded the agent economics didn't pencil at a flat subscription price. They banned the workflow. They did the math.

There are similar reports from the field. A financial services team described $47,000 in token costs over three days after 23 subagents kept analyzing code unattended. Replit bills failed checkpoints. Devin bills ACUs whether the work was useful or not. The pattern repeats: agents pay full retail for every wrong turn. Humans pay nothing extra for redirects.

The harness pattern keeps the human in the loop because the human is doing the cheap part — the reasoning — and the AI is doing the expensive part — the implementation. The harness exists to make the human's review trustable, not to replace it. That's the inversion the agent literature misses.

There's also a humbler reason. We're still better at reasoning than AI today. That gap is closing. It isn't closed. Token cost isn't going to close it this year. If AI gets sharply better at reasoning, the harness pattern still works — it just makes the human's job easier, not optional.

And there's the part I think we don't talk about enough. The agent paradigm is, in practice, an effort to take humans out of the equation. The companies driving it have profit incentives that don't line up with the social ones. Layoffs are real. Government hasn't worked out how to handle the transition. Most engineers I know don't want to be replaced. We want better tools. Tools that make our reasoning more leveraged, not optional.

I think the AI-as-developer paradigm — human directs, AI implements, harness verifies — is the one most engineers will actually live in. Not because we can't push autonomy further. Because we shouldn't.

I hit this on a real port in a real codebase, with a real production iOS app on the other side of the parity check. The harness didn't just catch six bugs. It changed how I think about what "tests passing" means when the AI is the one writing the code.

The Pattern Transfers

Same four oracles work on a server-side stack, with different inputs:

  • Visual — drop it or replace with API contract testing
  • Behavioral — replace the accessibility tree with the Datadog/OpenTelemetry trace tree
  • Data — same as mobile (response and DB write signatures)
  • Golden — same as mobile (pure-logic input/output fixtures)

Trace-shape parity is, I think, the killer oracle for server-side AI-assisted dev. Tests catch correctness. Trace shape catches operational equivalence. The AI refactors a controller, every unit test passes, but the trace tree gained an extra database call — that's the regression class invisible to tests but loud in production.

I haven't built the server-side version yet. It's the next harness on my list. If you've built one, I'd love to compare notes.

The thesis I keep coming back to is simple. When the model writes the code, the structure around the model is the engineering. The harness is where the judgment lives. Six bugs in a day is what that looked like for one port. I think it scales.

— Bill John Tran

© 2026 Bill John Tran. All rights reserved.

Ask about Bill John Tran

I'm an AI trained on Bill John Tran's complete career — resume, projects, skills, and writing. Ask me anything.