Currently splitting my time between a master's in AI at Georgia Tech and shipping software at dynaConnections, where we build MLS tools, the databases and portals real estate agents use to list and sell homes.
Before that I prototyped blockchain settlement at PayPal. Earlier, I did embedded work at Fox Factory, designing PCBs and writing C firmware for Live Valve Neo, the wireless bike suspension reimagining what mountain biking looks like.
Lately I've been pretty deep in AI-assisted development. Write a spec, let the agents take a rough pass, then review what comes back. My home server is the playground for this, where any silly idea for a microservice goes from concept to deployed in a couple prompts. At work it's an AI agent that runs QA on a legacy MLS platform — reading the ticket, driving the browser, and checking the database and feed underneath.
In my free time I'm usually failing to lift weights, getting humbled by V3s, collecting artisan keycaps, or theorycrafting in tft.
A quant lab for normal people. Describe a trading idea in plain English and get a real backtest — the LLM emits a structured strategy DSL instead of code, and a pure-TypeScript engine runs it against free market data, benchmarked against buy-and-hold. BYO LLM key.
Failure hands you a jolt of motivation, and mistaking that jolt for progress is the trap. On why habits beat motivation, and the one good use for the surge before it fades.
Putting an AI agent in charge of QA for a legacy MLS platform. The hard part wasn't the agent — it was how it drives the browser (Glance, then Claude's built-in Chrome, then Playwright MCP) and making sure it never gets to decide what "correct" means.
Failure is a great motivator, and that's exactly why it's a trap. Every time something I care about falls through, I get a surge of resolve, and every time it's gone before it changes anything. The mistake was thinking the surge was the point. It isn't — but spent fast, it's good for exactly one thing.
After the email
This week it was an interview. I walked out sure I'd nailed it; the rejection landed before I got home. That happens. What's predictable is what came after — the urge to overhaul everything at once. Train harder, study more, finally read the stack of books I keep stepping around. I've been in this exact spot after every failure that stung, and the resolve has never once survived an ordinary Tuesday.
Bad fuel
The trap is treating that surge as the prize, as if the point of bombing the interview was the resolve that came after it. If that's your engine, your growth is outsourced to bad luck: you only move when something hurts, and you reset to zero the moment it stops. It's the least reliable fuel there is, because it drains exactly when the pain does — right when you'd need it to start working. (Cal Newport has made a version of this point for years: the routine runs whether the feeling shows up or not, which is why the people who don't wait to feel inspired out-ship everyone who does.)
What it's for
So the surge can't carry the work. It can carry the setup, though, and that's the entirety of its job: spend it, right now, on building the system that has to run once it's gone. You need some push to go make the habits in the first place — that's what the push is for, and nothing else.
So I'm not using this week's jolt to feel fired up. I'm using it to install structure while I still have the energy: training on the calendar, work hours set, entertainment decided in advance, the books moved somewhere I can't ignore them. Convert the feeling into a system before the feeling expires. The motivation pays for the construction, then it's free to leave.
The rules
When I got up I wrote the rules I'm installing on a note and stuck it where I can't avoid it. Not affirmations — rules, the kind that decide for me at the moments I won't decide well on my own.
Don't eat or drink alone. There's a time and a place for both, and on my own at the wrong hour isn't it.
Work hard first, then play hard. I can't have it all, but I can have the best of it.
Earn entertainment. It's a reward for when I'm too tired to work, not an escape, and it belongs with people, not with hiding.
Read the books. They aren't decoration; they're distilled, high-value input. Use what's already on the desk.
Make the change now, not the next time something hurts.
Past day one
Day one is easy, because day one still has the jolt in it. I called this "past day one" because the only day that proves anything is the first one where I don't feel like it and the system has to carry me anyway. That handoff, from motivation to habit, is the exact point where every previous version of this note got quietly ignored.
So I'm not tracking whether I feel motivated. That number's going to crash, and I already know it. I'm tracking whether the habit is still standing on the day the feeling isn't. Let's see if I make it past day one.
We spent the last stretch putting an AI agent in charge of QA for an MLS platform that thousands of brokers log into every day — a decade-old app of listing wizards, search filters, and modals that open modals. The agent does the work now: it reads the ticket, figures out what the change touches, writes the test cases, drives the browser, and checks the database and the feed underneath. The interesting part wasn't the agent. It was the unglamorous question every phase leaned on — how does the agent actually click through the app? — and we answered that one three times before it stuck.
What we inherited
If you've ever opened an MLS portal you know the shape of it: forms stacked on forms, a listing wizard with a dozen steps, search filters nobody fully remembers, and one specific dropdown everyone in the office has a story about. QA happened by hand, on a schedule. The flows that actually mattered — creating a listing, inviting a broker, pushing a record out to the data feed — had close to no automated coverage, and what coverage existed sat on the safe, boring utility code that never broke anyway.
So the goal wasn't "write more tests." It was to hand an agent a release ticket and have it do the QA the way a careful engineer would, end to end, and leave behind evidence a human could sign off on.
The pipeline
What we built isn't a regression suite. It's a pipeline the agent runs once per ticket. Give it a ticket key and it pulls the ticket into a working dossier, diffs the code change against our local clones to figure out the blast radius — which surfaces and fields the change can actually reach — writes a test plan and a set of cases, executes them against a running build, verifies the backend the UI was supposed to have changed (the database rows, the records on the feed), and writes a sign-off back to the ticket. A QA engineer approves at two gates: once on the plan, once on the results. Nothing moves on the agent's say-so alone.
One run per ticket, top to bottom. Steps 1–5 are the agent's; the two diamonds are human gates — nothing reaches the ticket until a QA engineer signs off on the plan, then on the results.
Underneath it sits the piece that took the most thought: a separate repo of behavioral docs — what each surface is supposed to do, which fields exist, who's allowed to see them, the business rules. We call it the vault, and it's the agent's reference for what "correct" means. I'll come back to why it's a separate thing and not just the code, because that turns out to be the whole game.
Two repos do the work. The vault is the spec — the independent source of what "correct" means; the engine reads it, drives the live app, and writes the plan, cases, and sign-off back to the ticket. Whatever it learns about a surface gets staged back into the vault, so the spec grows with every run.
Driving the browser
Every phase above bottoms out in the same primitive: the agent has to operate a real browser — log in, walk the wizard, read what's on the page, fill fields, and come back with proof of what happened. We tried three ways to give it that.
We started with Glance, an MCP server a friend and I had just shipped that hands Claude Code a real browser. For exploratory testing it's genuinely good — point it at staging, let it wander, and it'll turn up the broken filter and screenshot it for you. But a QA pipeline needs the same run to mean the same thing twice, and to leave behind something a human can audit without re-running it. What Glance left behind was a transcript of what the model decided to do. That's the right artifact for exploring and the wrong one for a gate. No knock on Glance — it's built for the open-ended case, and a per-ticket QA run is the opposite of open-ended.
The next thing to try was the one already in the box: Claude's built-in browser control. No extra server to run, no config — the agent can just drive Chrome. It didn't make the cut, and the reasons were all the ways QA differs from casual browsing. It works off what it can see rendered on screen, so on a dense legacy form it's slow and it misses things that aren't visually obvious. There's no structured snapshot of the page to assert against or to diff a baseline from — you get pixels, not a tree. We couldn't reliably hand it a saved login and have it start authenticated, couldn't capture the network traffic a feed test depends on, and couldn't pin it to a specific local build with managed test credentials. For "go look at this website" it's fine. For "produce the same evidence-backed result on every release," it wasn't the tool.
Where we landed was Playwright MCP — the Playwright browser exposed to the agent as a server, with its testing, storage, and network capabilities turned on. The difference that mattered: it gives the agent the accessibility tree, not a screenshot. The agent targets a field by its role and name the way a screen reader would, which holds up far better across a sprawling old UI, and that same tree is a structured artifact we can diff a page against later to catch a surface drifting out from under its spec. Saved auth means a run starts already logged in, with no credentials anywhere near the prompt. Network capture and screenshots come for free on every run, so a failure ships with a trace instead of a story. And it's the same engine we'd later promote a throwaway agent walk into — a durable, checked-in spec — so there's a path from "the agent did this once" to "this is covered for good."
The correction worth making, because the easy version of this story is wrong: we didn't replace the agent with a static test suite. The agent is still the one running the tests. Playwright MCP is just the substrate it drives. What changed across the three attempts wasn't whether an agent was in the loop — it was whether the thing under the agent could give a QA gate three boring necessities: determinism, real evidence, and a login.
Where "correct" comes from
The failure mode with an agent grading software is the obvious one: it reads the code, decides what the code does, then "verifies" that the code does that. It's a tautology, and it passes even when the code is wrong. So the rule the whole thing is built around is that the expected answer never comes from the implementation under test.
The live app is only ever "what is" — never "what's correct." Correct lives a layer up: the ticket's acceptance criteria, a product decision, or the vault. The agent doesn't get a vote on it.
Around that we layered the usual defense-in-depth idea, with one specific demand: every class of bug has to be catchable by at least two layers that fail for different reasons, so one blind spot doesn't reach production. A wrong fix gets caught both by a coverage check — did we even write a case for the path that changed? — and by asserting the same behavior in the UI and the database and the feed at once. A fix that quietly breaks something else gets caught by the code-derived blast radius and by diffing the page against its last known baseline. And when something slips through anyway, the job isn't just to fix it — it's to add the check that would have caught it, so the same bug can't escape twice. The vault grows every time the agent touches a surface, and a bug it finds gets written back as a permanent note rather than left in a ticket comment that evaporates.
Where it stands
I'll be honest about the state of it: some of those layers are solid and in daily use, and some are still scaffolding with a TODO on them. The design is further along than the build. But the spine — agent runs the ticket, Playwright MCP drives the browser, the oracle for "correct" lives above the code — is load-bearing and real, and it's run against actual release tickets, not a demo.
If I were starting this over, the part I'd protect from day one isn't the browser layer. That took three tries and the third one stuck; it's the part people ask about and it's the easy part. The hard part — the one that decides whether agent-driven QA is real or theater — is making sure the agent is never the one who gets to decide what "correct" means.