AI QA on a legacy codebase: why we picked Playwright over our own MCP
We inherited a real estate portal that thousands of brokers used every day and a test suite that covered roughly nothing the user actually touched. A year later the gate is Playwright, the suite runs in roughly half the time it used to, and the browser-driving MCP I co-wrote isn't in it. Here's why.
The starting state
If you've ever opened an MLS portal you know what we were working with: a decade of forms, listing wizards, search filters, modals that open modals, and one specific dropdown that everyone in the office had a story about. Unit coverage hovered around 12%. Most of it sat on utility functions nobody had ever broken. The flows that mattered (listing creation, broker invite, search-to-offer) had no automated coverage at all. QA happened on Tuesdays, by hand.
When we started looking at agent-driven QA, the pitch was hard to argue with. Hand the agent a running build, tell it what a listing should look like, let it click around. We'd just shipped Glance, an MCP server that gives Claude Code a real browser, and the demos were impressive: it'd log in, find the broken filter, screenshot the diff, write up the bug. For exploratory testing it felt like magic.
Where browser-driving agents stop being magic
The problem with using an MCP to run regressions is that the loop is non-deterministic exactly where regressions need to be boring: selector strategy, retry behavior, and what counts as a pass. The same prompt against the same build can take a different path through the UI depending on what the model decides looks "right." That's fine when you're exploring. It's a problem when you're gating a deploy.
Three concrete failures kept showing up in our first month:
- A flaky selector strategy. The agent would click a button by accessible name on one run, then by index on the next. When the team renamed a label, half the runs noticed and half didn't.
- Runaway token cost. It scaled with the number of regression cases, not the size of the change. Re-running a couple hundred scenarios after a one-line CSS fix made nobody happy.
- No useful failure artifact. When a run failed, we had a transcript. What we wanted was a trace, a video, a network log, and a deterministic seed.
An MCP-driven agent is great the first time you do something and worse than a script every time after.
White-box tests in the middle
Before we touched the browser layer again we did something boring: we wrote unit tests. Not for the whole app — for the four or five modules the agent kept tripping over. The listing-form reducer. The search query builder. The permission resolver. Each one had a clear input and output and a small enough surface that unit tests were obvious.
The unit layer is what made the end-to-end layer cheap. Once the reducer had unit tests, the Playwright spec for "agent creates a listing" stopped needing to assert on twenty intermediate states. It could assert on the final API call and trust the unit suite to catch the rest.
We used the AI agent here, but as an author, not a runner. We pointed it at a module, asked it to enumerate the behaviors it could see in the code, and reviewed what it proposed. Some of the suggestions were wrong. The ratio of "good first draft" to "I'd have written this myself" was high enough that the agent paid for itself by the end of the first week.
Why Playwright won the bottom layer
For the end-to-end layer we tried three things in parallel: Glance driving Chrome via MCP, a Cypress suite, and a Playwright suite. Cypress fell out first for unrelated reasons (cross-origin in our auth flow). The real comparison was Glance vs Playwright.
Playwright won on four axes:
- Determinism. A spec is a spec. The same input gives the same output. A regression failure points at a line of code, not a transcript.
- Speed. Parallel workers, browser context reuse, and trace-on-failure replaced our agent loop's serial click-and-wait. Our full regression run dropped by roughly half.
- Artifacts. When a spec fails in CI you get a trace viewer, a video, a network log, and a DOM snapshot. The first time an on-call engineer opened a trace and found the bug without rerunning anything, the pushback stopped.
- Cost. A Playwright run costs CPU minutes. An MCP run costs tokens times retries. The math isn't close once you're running on every PR.
None of that is a knock on Glance. It's a knock on using a non-deterministic tool for a deterministic job.
Where the agent still earns its keep
We didn't throw the MCP layer away. Glance lives in two places now:
- Exploratory passes on new features. Before we write Playwright specs for a flow, the agent walks it. It finds the surprises a human tester would find, and we write the specs against the surprises instead of against the happy path.
- Bug reproduction. A customer report comes in. The agent reads the report, tries to reproduce it against staging, and either pastes the steps or tells us it can't. The Playwright spec, if one is warranted, comes after.
The order that actually worked
If you're staring at a legacy codebase with low coverage and an agent prompt that can sort of do QA, the order that worked for us was: white-box the modules that hurt, write deterministic E2E specs for the flows that gate revenue, and keep the agent around for the parts of testing that are still creative.
The Playwright suite halved our regression runtime. The part that changed the team was the unit layer in the middle. Boring, and the thing I'd do first next time.