Why AI Code Needs the Same Rigor We Should've Been Using All Along
On the gap between what we ask for and what gets built
This came out of a discussion on “Slop is not necessarily the future”. I commented that technical debt from sloppy code shows up too late to fix. someone replied: “Humans also write sloppy code.” That’s absolutely right, but it got me thinking about what’s actually different when AI is involved.
The debate about “AI writes sloppy code” versus “humans write sloppy code too” kind of misses it. It’s not about who writes worse code. It’s about how AI’s interpretation and human intention actually align.
I’ve been using AI to generate code pretty heavily. The problems I keep running into aren’t that different from the problems I’ve caused myself over the years. The difference is speed and volume. But there’s something specific that keeps nagging at me: when AI misunderstands what you want, it commits fully to the wrong interpretation. No clarifying questions. Just goes.
Two things I keep coming back to: the gap between what you meant and what got built, and the fact that you can’t predict which code will stick around.
The Gap: When AI Picks What It Thinks Is Reasonable (But Not What You Meant)
AI has extremely wide understanding. Ask it to solve a problem, it knows dozens of valid approaches. When your prompt is vague, it just picks one and runs with it.
Some examples I’ve hit:
“Add error handling” and it wraps everything in try-catch with console.log. I wanted typed error propagation so the caller could decide. “Make this faster” and it rewrites the hot path with a clever optimization. Benchmarks look great. Two weeks later there’s corrupted data in edge cases I didn’t mention. “Add validation” and it puts input checks at the API boundary when I meant the domain layer. Now validation is in the wrong place and the domain model still accepts invalid state.
Humans do this too. But humans usually ask clarifying questions first. AI just commits.
At any given moment, your understanding of what you need and AI’s interpretation of what you asked for are two different things. The code that gets written lives in that gap.
The Survival Problem: You Don’t Know What Will Stick Around
A Google engineer in the thread mentioned something that stuck with me:
“I think I calculated the half-life of my code written at my first stint of Google (15 years ago) as 1 year. Within 1 year, half of the code I’d written was deprecated, deleted, or replaced, and it continued to decay exponentially like that throughout my 6-year tenure there.
Interestingly, I still have some code in the codebase... I submitted about 680K LOC and 2^15 is 32768, so I’d expect to have about 20 lines left, which is actually surprisingly close to accurate (I didn’t precisely count, but a quick glance at what I recognized suggested about 200 non-deprecated lines remain in prod).”
680,000 lines down to ~200 in 15 years. But here’s the key: the author expected 20 lines based on exponential decay, got 200. 10x off. Even with a mathematical model, you can’t predict which code survives. And those 200 lines? Probably not the ones he’d have chosen to keep.
You write a quick fix to ship something. Three years later it’s still there, load-bearing infrastructure. The placeholder variable name is part of the public API.
AI makes this worse. You can generate a thousand lines of “just get it working” code in ten seconds. How much of that will still be running three years from now? No idea.
What I’ve Settled On: Test Everything, Then Test It Again
So you’ve got two problems: AI might not understand what you meant, and you can’t predict which code becomes permanent.
What I’ve settled on is 100% test coverage at every level. Yeah, that sounds extreme, and in practice you never actually get there. But treating it as the goal changes how you work.
Not just “write some tests.” Unit tests (does each piece do what it’s supposed to?), integration tests (do the pieces work together?), business logic tests (does it actually solve the business problem?), and system tests end-to-end. The unit tests catch “AI picked the wrong algorithm.” The integration tests catch “AI put validation in the wrong layer.” The system tests catch edge cases you didn’t know existed.
What took me a while to realize: tests aren’t just for catching bugs here. They’re for verifying that what got built is actually what you had in your head. The whole chain from your mental model to a natural language prompt to AI’s interpretation to generated code, every step is lossy. Tests are how you check whether the signal survived.
Where It Gets Iterative
Even with all that coverage, you’re only testing against what you currently understand. There are always gaps.
First pass: you write tests based on your understanding, AI generates code, tests pass, you think you’re done. Then you start poking at corner cases. What if the input is empty? What about two operations at once? You find gaps, add tests, some fail, code gets fixed.
Then you do something that feels weird: you ask AI to find the edge cases you missed. “What am I not testing?” Turns out AI is actually good at this, because it’s seen thousands of similar systems fail. It suggests scenarios you hadn’t considered. More tests. More failures. More fixes.
I had this happen with a data processing pipeline. Happy path tests all passed. Then I started asking about mid-record stream failures, malformed data that passes validation but breaks downstream, concurrent workers hitting the same data. Half the new tests failed. Asked AI what else could go wrong. It came back with memory exhaustion, unavailable output destinations, crash recovery. Hadn’t thought about any of those. By the end I had a system that was genuinely solid, not because AI wrote perfect code, but because the back-and-forth kept closing gaps.
Each iteration, you clarify what you actually need, AI understands better, and the tests protect code that might survive years.
Semantic Drift (Where Both Problems Hit at Once)
One specific thing that burned me: AI optimized a hot path in a system I maintain. Benchmarks looked great. Tests passed. Two weeks later, corrupted output in edge cases.
The optimization changed the semantics in a way my tests didn’t verify. Still a pure function in the common case, but not in the rare one. Code looked correct at the time. Passed everything. Hidden semantic shift just waiting to bite.
After that I added a rule: any AI-generated change needs tests that verify the semantics didn’t drift. If it’s supposed to be a pure function, write a property test that proves it. Idempotent? Run it twice and check. This isn’t about who or what wrote the code. It’s about having a process that verifies the code actually matches what you meant, and holds up over time.
What Else Changed
Testing is the core, but other stuff had to tighten up too.
CI gates that don’t bend. Every AI-generated PR hits the same pipeline: tests pass, coverage at 90%, build succeeds. We used to let things slide when rushing. When code is getting generated this fast, the question is how to keep everything else up.
Code review changed focus. Used to be about catching mistakes. Now it’s: “Are these tests comprehensive enough? Did we verify the edge cases? Is this even the right approach?” The assumption is the code works. Review is about whether we’re solving the right problem.
One thing that surprised me: bug density for AI code vs human code, when both have the same test coverage? Basically no difference. The problem was never AI. It was misaligned requirements and untested processes. Maybe it always was.
The Hard Part
None of this is technically difficult. It’s cultural.
For years we treated tests as “nice to have” or “we’ll add them later.” Shipped fast, cut corners, celebrated velocity. AI makes that unsustainable. When code is cheap, the bottleneck moves. Writing code isn’t the expensive part anymore. Figuring out what you actually need, making sure what got built matches that, making sure it holds up over time. That’s the expensive part now.
nocman had this comment on HN about treating code as craft, how it’s not optional, it’s how you build things that last. I agree, but not the way most people mean it. Craft isn’t about hand-writing every line. It’s about knowing exactly what’s in your system and why. Doesn’t matter who wrote it.
If you’re using AI to generate code but not investing in this kind of iterative verification, you’re building on quicksand. Some of that code will be fine. Some will survive for years. You won’t know which is which until it’s too late.
The answer isn’t “use AI less.” It’s: build the process around it. Tests at every level. Iterative gap-closing. CI that actually enforces things. Review focused on approach, not syntax. Not because of who or what writes the code. Because you need a process that makes sure the code matches what you meant, and survives what comes next. That’s not something you can wing.
This came out of a HN discussion where someone pointed out that humans write sloppy code too. They’re right. The question isn’t who writes sloppier code. It’s how you manage quality when code generation is 10x, 20x, maybe more faster, and that speed amplifies both the gap between what you ask for and what gets built, and the uncertainty about what will survive.
The “craft” argument (shout-out to nocman) is right, but people implement it wrong. Craft isn’t writing every line by hand. It’s going through the iterative process of verification and alignment, whether you wrote it or AI did.