Code generation was the easy part

Contents

Why does verifying AI code cost more than writing it?
How do you point AI at a billion-line codebase?
Why did AI turn code review into a bottleneck?
How should teams budget for AI token spend?
How do you keep AI-generated code accurate at scale?
The thread running through all five

Key takeaways

Verification now costs more than writing the code. Teams spend more to check what AI produces than to produce it. The fix is to build the tests with AI, then take the model out of the loop.
You can’t fit a real codebase into a context window. The skill is choosing what to feed the model, not buying a bigger window.
Code review is where the work piled up. AI writes more code than people can review the old way, and senior engineers are stuck with the backlog.
Token budgets are forcing the discipline that adoption goals never did. Once spend is capped, teams finally ask whether every task needs a frontier model.
Accuracy comes from engineering rigor, not the model. Teams with strong CI/CD are pulling ahead. The rest are just making tech debt faster.

Every year I host this panel at Code Remix Summit, and every year I open with the same pitch: we’re here for the hard stuff. Not the demos that work on the first try, but the pioneering work nobody has fully figured out yet. The whole point is to compare notes, share what’s working and what isn’t, and learn from each other. The best part of running it is that these leaders actually do that. They’re honest about what’s still broken.

Last year, this same panel was about technical debt: whether AI could help teams dig out from years of upgrades and migrations. We landed on a careful yes. It helps, but it isn’t hands-free, and only when you pair it with deterministic tools. This year, nobody wanted that debate. For teams like these, AI writing the code is a given. In a regulated shop with thirty years of legacy, it still isn’t, and that’s a different panel. Either way, the harder question is what breaks after the code gets written.

One number set the tone before we started. My buddy at Databricks told me his team now spends more verifying AI-written code than generating it, more than their token bill. The model was the cheap thing to buy.

So I got six people who live this every day in the same room, and we dug into the five places it actually hurts: verification, scale, speed, cost, and accuracy. Here’s what stuck with me.

On the panel

Watch the full panel from Code Remix Summit →

Why does verifying AI code cost more than writing it?

The moment you let AI write code freely, you get a lot more of it, and all of it still has to be checked. That work doesn’t go away. It moves from writing to verifying, and verifying is the expensive part.

The panel’s answer was a little backwards: when AI causes the problem, use less of it to fix it. Dov Katz at Morgan Stanley builds his verification with AI up front, then takes the model out of the day-to-day. When his team runs an automated change across the codebase, the first check just confirms the change did what it promised, and more checks follow. The model helped design the safety net, but it isn’t holding the net every time code falls through. Those checks are cheap and predictable, and they don’t need a model running.

Rachel Laycock at Thoughtworks has a name for this: harness engineering. You build the guardrails into the agent’s environment so it gets things right on the way in, instead of catching mistakes on the way out. Brian Houck made the point I repeat most often: none of this is new. We’ve always needed good static analysis, linters that aren’t noisy, and tests that catch what a customer would. AI didn’t rewrite the rules. It just produced enough code, fast enough, that the gaps we used to tolerate started to hurt.

Jonathan Schneider described a sharper version, a kind of ratchet. The agent writes tests for every case it can think of, then a recipe (a scripted, repeatable change) to make those tests pass. It runs on one repo, then a wider set, and each new failure becomes another test. Nothing loosens. It only tightens. Run that same recipe across hundreds of repositories and the tests in one repo quietly cover for the thin spots in another.

There’s a sort of herd immunity of imperfect test coverage across many repositories that helps ensure the change is safer in any one of them.
— Jonathan Schneider, Moderne

Not everyone was selling the upside. Anshu Chadha at Uber pointed straight at the bill. All that speed burns compute: more code, more generated tests, more builds, and a real choice between buying more cores or changing how the work runs. Uber’s move is to stop treating every agent like a person who needs an answer right now. Work a developer is waiting on still has to come back fast, in twenty minutes or less. Work an agent is waiting on inside an orchestration can be pushed to 2 a.m., when the cores sit idle anyway. Jonathan had the right name for that: mainframe batch jobs, reinvented. Harald Aamot from SAP put it plainly. Point agents at a codebase with under 85% test coverage, and you’re asking for problems.

How do you point AI at a billion-line codebase?

You don’t point it at the whole thing. The skill is deciding what slice of the code the model needs, and handing it only that.

Brian was blunt about why: none of us has a billion-token brain, and neither does the model, so a bigger context window was never the answer. Dov pushed for deterministic tools to find the right slice, like solid search and language-aware indexing, so the model isn’t burning tokens rediscovering structure it could have looked up.

Anshu reframed the whole thing as a people problem. Drop a strong engineer into an unfamiliar codebase and they narrow down fast: the right team, the right file, the right problem, before they touch anything. The people who write off AI tend to do the reverse. They aim a model at a huge problem with almost no context, then act surprised when it falls over. Uber had to teach its own engineers to break a problem into pieces and feed the model only what it needs, and that turns out to be the same skill whether a person or an agent is doing the work. Harald traced it back to Jez Humble and Dave Farley’s old rule from Continuous Delivery, divide and conquer, which is also why small, stackable OpenRewrite recipes hold up so well.

Jonathan’s was the version I keep chewing on. He worries less about sizing the task and more about sizing the data. Take the tangled structure of a codebase and flatten it, the way a painter puts a 3D street scene onto flat paper. Lately that means dumping the code into a plain table and loading it into a database, so the agent asks its questions in SQL instead of reading a wall of text. Rachel summed it up: what’s good for agents is usually good for humans. The knowledge-graph tricks her teams use to make sense of old mainframe code are years old. The tools are faster now, but a person still has to read what comes back and decide what it means, because an agent drowns in too much context the same way we do.

Why did AI turn code review into a bottleneck?

Review broke for a simple reason. AI multiplied the number of commits and pull requests without changing how teams approve them, so the pile landed on the most senior engineers, the ones who already had the least time.

Anshu put a hard number on it. Over a year at Uber, 66% of code reviews were approved without a single comment. That doesn’t prove nobody read the code, but it’s a bad sign, and it suggests a lot of review is theater for changes that were never risky. His fix is to stop pretending. Let low-risk changes land once a machine has cleared them, the way Spotify ships some deploys with no human in the loop, and spend senior engineers’ attention on the changes that can actually hurt you.

Brian came at it from cost. Human time is worth more than tokens, and he watched one team get a bigger gain from deleting bad meetings off the calendar than they got from AI on their pull requests. Rachel has argued against reviewing everything for years. She wants risk scoring to route the scarce, experienced people to the complex and dangerous parts of the system, not to rubber-stamp the routine stuff. The disagreement got good when Brian admitted he loves code review, not for catching bugs, but for the culture and the knowledge it spreads. Rachel’s answer was to ask whether he’d heard of pair programming, since the design talk and the mentoring he values belong up front, while the code is being written, not bolted on at the end. She wants code design, not code review.

Dov offered something you could build today. Your version history holds years of expert review comments, and the useful signal isn’t only the comments, it’s the outcomes attached to them: which changes got approved after that feedback and which got rejected. Train a model on that and you get a passable first reviewer before any human looks. He also named a quieter win: good background agents let a senior engineer step away and come back without losing the thread, so they don’t have to sit heads-down for six hours to get anything done. Jonathan flagged the flip side, an instinct he sees people brag about, which is running eight agent sessions at once. Usually that’s not a win. It means the first seven were too slow to hand you anything to review, and now you have eight pull requests stacked up. Make one session fast enough and you’d only want two.

How should teams budget for AI token spend?

The short answer the panel kept circling: treat tokens like compute, give each developer a budget, and tie it to business results. Nobody claimed to have the forecasting figured out. I opened this part with a number that gets people’s attention, from a large US financial firm where a single engineer, not even a top performer, burned through ten billion tokens in a month.

Anshu spends most of his time on this now, and he was honest that it’s unsolved. He offered to buy beers for anyone who has cracked the forecast, and pointed out that the old all-you-can-eat enterprise deals are gone. His team treats spend like cloud capacity. A heavy user runs around $2,000 a month, and going past that needs a reason, because his CFO and CEO keep asking whether the bill is turning into revenue.

Brian felt the change firsthand.

We’re twelve days into the month and I’m already looking at my monthly quota ticking down. How in the world does the world operate like this?
— Brian Houck, DX

He’d gone from unlimited tokens at Microsoft to a hard budget at a smaller company, and the shock stuck. His read is that the people who come out ahead will be the ones who know exactly what they want to build and can describe it clearly, handing the model only the context it needs. Dov made the same point personal. If the money came out of your own pocket, you’d hurry to stop spending it, which is why his instinct is to build a reusable tool once and pull the repeatable work out of the token loop for good. Harald drew the line between building faster and building the right thing, and asked teams to justify the tokens they burn against real customer value. Anshu turned it around: spend too little and you can watch your lead shrink against startups with sharp engineers and no fear of the bill.

Jonathan zoomed out the furthest. Software has always been limited by the number of engineers you have, and given a budget, a developer spends it on what they want to build, not on maintenance. That’s why their hundreds of repositories sit on dead dependency versions nobody wants to touch. The only way that work gets done at scale is to move it onto deterministic systems, because no one is spending their own tokens on it.

Rachel had the sharpest version. Last year her whole job was getting people to use the tools at all. This year she blew through her token budget by the middle of Q2 and had the CFO conversation everyone is about to have. Her response wasn’t to clamp down on usage. It was to demand discipline: does this really need eight agents at once, does all of it have to be nondeterministic, does every job need a frontier model? When there’s no limit, you reach for the frontier model for everything. When there is one, you have to decide where that power is worth paying for, which is the same harness engineering that ran through the verification conversation. Constraints, as she put it, are the mother of invention.

How do you keep AI-generated code accurate at scale?

Accuracy doesn’t come from a better model. It comes from engineering rigor and from picking the right tool for the job. Code quality is a board-level topic now.

Dov makes accuracy something the model can’t skip. Write the requirements down, share them across teams, and turn them into tests the code has to pass, so quality isn’t a polite suggestion the model forgets halfway through. Brian’s reminder was simpler: measure outcomes, not activity, and keep the pre-AI fundamentals, guardrails, isolation, and staged rollouts, that none of this changed.

Rachel was blunt. The teams getting both speed and trust from AI already had strong CI/CD and had already done the unglamorous work on their legacy code, and bolting harness engineering onto a shaky foundation won’t save anyone. Her advice was almost old-fashioned, to go reread the field’s own foundations like Continuous Delivery and Working Effectively with Legacy Code, because without rigor all AI does is ship risk and technical debt faster.

Anshu gave the most concrete rule of the day. Stop reaching for the LLM hammer on every task. When Uber couldn’t tolerate a small hallucination or a confident-but-wrong answer, they moved that work back to deterministic tools, AST-based transforms for code and pixel-by-pixel checks in Selenium for tests, because the job needed a sure thing, not a good guess. Harald widened it to the part everyone forgets. You don’t make money building an app in a few days, you make it running that app for ten years, which is why SAP is investing in OpenTelemetry to watch what the agents built long after they’ve built it.

The thread running through all five

Step back and the same shape shows up in every topic. Generation is cheap and judgment is scarce, so the teams pulling ahead are the ones who let a deterministic layer carry the repetitive work, which frees their people to spend judgment where the real risk is. Deterministic tools for the work that should be repeatable, visibility for the work that still needs a human. Between them, they cover most of the hard problems this panel kept circling, and none of them require a better model to get started.