Skip to content

Blog

What I Think About AI Engineering

A 14-part series on AI, engineering culture, and building software in the age of generative tools.

Read the series

AI Is an Amplifier, Not a Replacement

AI amplifies what's already true about an engineer or a team. Used well, that's order-of-magnitude gains. Used lazily, it's technical debt on autopilot.

If your team has strong fundamentals, clear conventions, and good testing discipline, AI makes all of that more powerful. You move faster. You tackle harder problems. You spend less time on implementation work that used to be the bottleneck. The gains are real. Order of magnitude, not percentage points.

If your team is sloppy, expect sloppier output, faster. Vague requirements produce broken features at speed. Untested code accumulates. A developer who accepts every diff without understanding it will build you a house of cards faster than anyone can review it.

What you bring to the table is what gets amplified.


Mehran Sahami, chair of Stanford's CS department, said: "Computer science is about systematic thinking, not writing code." That was always true. AI makes it undeniable.

When only a small portion of the population was literate, the physical act of putting words on paper was considered intellectual work. People took pride in calligraphy. As literacy spread, "writing" stopped meaning penmanship and came to mean arranging ideas into a readable format. The physical act was commoditized. The thinking wasn't.

The same shift is happening to programming. Transcription is being handled by machines. What remains is knowing what to build, how it fits together, what the tradeoffs are, what the system needs to hold together in five years. That's not a diminishment. It's a clarification of what the job always was.


The developers who get the most out of AI are the ones who knew what they wanted before the AI wrote it. They come to a task with a sense of what good looks like and use AI to get there faster. They can tell when the output drifts. They push back.

The ones who struggle treat AI as an oracle. They ask it to solve the problem from scratch and accept the first plausible output. The code looks right. It may even be right, for now. But they don't really know, and that gap shows up in a debugging session, a code review, a production incident.

Senior engineering always meant knowing what you want before you write it and being able to detect when something's wrong. AI didn't create that requirement. It raised the cost of skipping it.


The same logic applies to teams. Teams with strong conventions and high-quality process get more out of AI than teams without them. The investment in documentation, clear architecture, and well-maintained tests now compounds differently. AI without deep context caps out quickly. With it, the tool behaves differently.

Teams that dismiss AI are leaving a real multiplier on the table. Teams that adopt it without the fundamentals will find they've just accelerated the accumulation of problems they already had.

The question isn't whether to use AI. It's what state you're in when you do.

Where AI Helps (and Where It Doesn't)

The most useful question to ask about any AI coding tool isn't "is it good?" It's: what constraint does it actually relieve?

AI relieves implementation throughput. It's fast at boilerplate, fast at scaffolding, fast at translating a well-specified problem into working code. On greenfield projects (fresh codebase, clean problem, minimal history) that's a lot of the work. You can build something in a day that would have taken a week.

Most engineers aren't on greenfield projects. They're on codebases with years of accumulated decisions, implicit conventions, and business logic that evolved from something else. The context that matters (why a system was designed a certain way, what the implicit rules are, what breaks when you touch a particular component) lives in people's heads. It isn't written down. AI doesn't have it.

That's not a model failure. It's a category mismatch.


The pattern holds across different types of constraints.

When the bottleneck is implementation skill, AI helps a lot. It compresses the work. When the bottleneck is domain knowledge (understanding a complex financial product well enough to model it correctly, knowing why a particular edge case in a 15-year-old system is handled the way it is) AI helps less. It can surface patterns from training data, but it doesn't know your domain the way a tenured engineer does.

When the bottleneck is organizational, AI might not help at all. Regulatory approval takes as long as it takes. Getting three teams to agree on an interface is a people problem, not a coding problem. Producing a working prototype before the actual constraint is resolved doesn't resolve the constraint. Sometimes it creates a new one, because now there's a demo and stakeholders have opinions.

Attaching a code-generating tool to a bottleneck that isn't code generation doesn't unblock you. It just produces more code upstream of it.


There's a specific trap on mature codebases worth naming. The complexity means the model is constantly making changes that are technically plausible but contextually wrong. A senior engineer I know described it plainly: AI can speed up initial engineering time significantly, but that saved time often gets consumed in extended review, fact-checking, and remediation. Net zero. The codebase has nuance the model doesn't know about, and catching its mistakes takes real attention.

Greenfield is genuinely different. The context gap is smaller because there isn't much context yet. Constraints are mostly implementation constraints, which is where AI is strongest. This is where a lot of the dramatic productivity stories come from (someone builds an entire working service in a day), and those stories are real. It's just not the situation most developers are in most of the time.


Before reaching for AI on any task, it's worth asking what's actually slowing you down. If the answer is typing speed or implementation volume, AI will compress that. If the answer is that you don't fully understand the problem, that the domain is unclear, or that the real obstacle is a conversation you haven't had yet, AI can't fix any of that. It'll give you something that looks like progress while the actual problem waits.

The Ratios Shift, But the Real Work Stays

"Coding is 20% of the work. Testing and everything else is 80%." Most engineers will nod at this and then act like closing a PR means the job is done. AI is making that self-deception harder to maintain.

When code generation gets 10× faster, it doesn't make software delivery 10× faster. It makes the 20% faster. The 80% is still there: alignment, planning, scoping, code review, validation, debugging, and the ongoing work that doesn't stop once a feature ships. Speed up one part of a pipeline and you don't accelerate the whole thing. You surface where the next constraint lives.

This isn't a new observation. DORA research has shown for years that the teams that improve most aren't the ones that type faster. They're the ones that invest in deployment pipelines, review culture, and test coverage. What's changing is that the implementation gap is closing fast, which means those other factors are increasingly what separates high-performing teams from the rest.


The downstream pressure is real and specific. More code is being produced, but the processes for checking it haven't scaled with it. Review queues back up. A senior engineer I worked with described it plainly: the time AI saves in initial implementation often gets consumed in extended review, fact-checking, and remediation. Net zero, or close to it.

Some of this is a temporary adoption problem. Teams haven't yet built the habits and tooling to validate AI-generated code efficiently. But some of it is structural. The model doesn't know your system. It doesn't know why something is the way it is, or what other parts of the codebase depend on a given assumption. Catching those mistakes requires the same judgment it always did.


There's a version of this that lands wrong: "AI doesn't actually help because it creates review overhead." That's not the point. The point is that you don't get the gains by treating AI as a code vending machine and skipping the parts that were always expensive. The gains come when generation speed and validation quality move together.

One line from a Pragmatic Engineer reader thread stuck with me: "It's not Claude's fault you don't do adequate planning or design." Good software engineering was never mostly about writing code. AI didn't change that. It just made the gaps harder to ignore.

Faster Output Demands a Higher Quality Bar

Speed without proof just moves the cost somewhere else. To the reviewer who has to catch what you didn't. To the on-call engineer who finds out at 2am. To the next developer who inherits code that looked right but wasn't.

This is the part of the AI productivity story that gets skipped. When code generation compresses, validation doesn't compress with it. The ratio shifts but the proof requirement doesn't.

Our job as engineers isn't to produce code. It's to deliver outcomes for the people using what we build. Code is the mechanism, not the point. A feature that ships fast and breaks in a real user's hands didn't ship. Including proof that it works is the actual job.


Manual testing first. If you haven't seen the code do the right thing yourself, it doesn't work yet. Finding out from a reviewer or from production isn't the same thing as knowing.

Once you've covered the happy path, start breaking it. Edge cases, error conditions, unexpected inputs. This is a skill AI doesn't have. It can generate tests if you describe what to test, but it can't tell you what you haven't thought to test. That requires someone who understands the system well enough to anticipate how it fails.

Automated testing follows. Easier now than it's ever been. AI generates test scaffolding quickly, so there's no longer a good excuse for skipping it. The bar has moved precisely because the tooling got better.


Watch a senior engineer use AI and it looks like magic. Complete features in minutes, tests included. But look closely at what they're actually doing. They're not accepting output. They're shaping it. They came to the task knowing what good looks like, they're detecting drift when it happens, and they're correcting it. The AI accelerates their implementation. Their judgment is what keeps it sound.

Junior engineers often miss this. They accept output more readily, move faster, and produce what gets called "house of cards code" (it looks complete until real-world pressure is applied).

The difference isn't that one uses AI and the other doesn't. It's that one knows what they want before they start, and can tell when they're not there yet. That's always been what distinguishes senior work. AI didn't raise that bar. It made clearing it harder to fake.


Speed matters because customer outcomes matter. Getting something working in front of a real user faster is genuinely valuable, since that's how you learn what to build next. But "faster" only counts if what you shipped actually works. A faster feedback loop built on broken code isn't a feedback loop. It's noise.

The instinct to move faster is right. The mistake is treating generation speed and validation rigor as a tradeoff. They're not. More output with weak validation means more review burden, more incidents, more debt accumulating faster than anyone can pay it down. More output with strong validation means durable velocity. The discipline is what makes the speed stick.

You Are the Compiler Operator

A compiler doesn't write programs. It translates them. You give it source code with clear syntax, type constraints, and defined semantics, and it produces something runnable. Give it ambiguous input and you get errors. Give it nothing and you get nothing.

AI works the same way. Instead of code, you're feeding it intent. The quality of what comes out is a direct function of how clearly you specified what went in. You are the programmer. The model is the compiler. Your job is to write specs it can work with.


The most important specs aren't the ones you write in a prompt. They're the ones you establish before you open a chat window.

Stack, libraries, database, deployment setup, design system, architectural invariants (the rules your system must never break, like "this service is always read-only" or "all writes go through this queue"). One developer I follow called these "the laws of physics" for a project. They define the space the model is allowed to work in. Leave them undefined and you'll get technically plausible output that doesn't fit your system. Changing those decisions later is expensive.

The instinct to start generating immediately is understandable. It feels productive. But a model working without constraints isn't faster, it's undirected. You'll spend more time steering it back than you saved by starting early.


The compiler analogy also explains where things go wrong. A compiler can only work with what it's given. It doesn't know what you meant. It doesn't know what came before or what will come after. It produces output that satisfies the spec you wrote, not the intent you had.

AI has the same ceiling. If the context is shallow (a one-line prompt, no codebase conventions, no description of constraints) the output fills in the gaps with plausible-sounding defaults. Those defaults may be reasonable in isolation and wrong for your situation.

One line I keep coming back to: "Claude might give you a Ferrari, but it doesn't stop it from driving into a wall if you let go of the wheel." The operator is always responsible for the outcome. That's not a limitation to work around. That's the nature of the tool.


Defining constraints upfront isn't overhead. It's the work. The engineers getting the most out of AI invest in context: clear convention files, well-documented architecture, explicit invariants, tests that encode intended behavior. They're not writing more prompts. They're building a better compiler target.

AI-Native XP: A New Workflow Emerges

Kent Beck published Extreme Programming in 1999. The core idea was simple: if short feedback loops are good, make them as short as possible. If testing is good, do it first. If integration is good, do it continuously. Take every practice you already believe in and turn up the intensity.

XP gave teams small stories, test-driven development (write the failing test before writing the code that makes it pass), pair programming, trunk-based development, and merciless refactoring. It was a direct reaction to the bloated, plan-heavy methodologies of the era. It worked well for teams that committed to the discipline.

Most developers under 35 have never worked in an explicitly XP shop. But right now, many of them are rediscovering its practices without the name, under pressure from AI tooling.


The workflow that's actually producing results with AI, assembled from what engineers across teams have figured out:

  • Break work into small, task-specific prompts. One problem at a time. Reconstruct context for the next task rather than carrying a sprawling conversation forward.
  • Prompt with tests and usage examples. Show the model what "done" looks like before asking it to build.
  • Run, observe, adjust. Keep the loop tight. Don't queue up ten changes and review them together.
  • Commit every acceptable outcome. Hard reset when it goes sideways. Don't try to negotiate a broken session back to working state.
  • Merge to trunk. PR-based processes get overwhelmed by the volume AI can generate. A lighter integration cycle keeps things moving.
  • Trust deterministic sources of truth: the actual code, test output, linter results. Not what the model says the code does. What it actually does.
  • Keep separation of concerns clean. Smaller blast radius per change means less to review and less to undo.

Every item on that list maps directly to an XP practice. This isn't convergent evolution. Engineers are arriving at the same discipline because it solves the same problem: how to stay in control when output is fast and mistakes are cheap to make but expensive to accumulate.


The parallel that matters most is TDD. Prompting with tests isn't just a quality practice. It's the clearest way to specify intent. A test is unambiguous: it either passes or it doesn't. When you give a model a failing test and ask it to make it pass, you've given it a compiler target in the most literal sense.

The one XP practice that doesn't translate directly is pair programming. What replaces it is something different: a developer maintaining a birds-eye view of the system while the model handles implementation detail. The model can't hold the big picture reliably. It doesn't know how a change three files away affects what it's writing now. That responsibility stays with the person at the keyboard.


Naming this matters. Teams that stumble into these practices organically benefit from them but can't teach them, defend them under pressure, or improve them deliberately. "We work in small chunks and test everything" is a habit. "This is Extreme Programming, and here's why it works" is a framework someone can act on.

Beck figured out the forcing function in 1999. Then it was software complexity. Now it's AI velocity. The answer looks the same.

Strategy in the Age of LLM Wrappers

Most so-called "AI-powered" products follow the same pattern: take a task someone does manually, write a prompt that automates it, wrap it in a clean UI, and ship. Some of these are genuinely useful. Most aren't defensible.

The tell is simple. If your entire product could be replicated by a developer with an API key and a weekend, it's not a product. It's a demonstration.


The dependency chain of a typical AI wrapper runs like this: the product sits on top of OpenAI, which runs on Azure, which runs on NVIDIA. Nobody in that chain except NVIDIA is difficult to displace. The wrapper is the most exposed position in the stack. That's where most of what's being built right now lives.

This doesn't mean you can't build a real business on top of foundation models. It means the business has to be about something other than the model. The model is infrastructure. What you build on top of it, and for whom, is the actual question.


The moats that hold up tend to come from assets the model can't provide. Proprietary data is the most discussed: a model fine-tuned on your company's historical interactions, domain corpus, or customer behavior does things a general-purpose model can't replicate. The data is the asset.

Less discussed, and harder to copy, is deep customer outcome knowledge. Knowing not just what your customers do but why, what they're actually trying to accomplish, and where the friction is between their current state and that goal. That takes years of proximity to the problem. It can't be prompted into existence.

Regulatory lock-in and becoming a program of record are the quieter moats. If your product is the system of record for a compliance workflow, or switching away requires a regulatory re-certification, you have durability that has nothing to do with model capability.


The framing shift that matters here is from Minimum Viable Product to Minimum Productive Outcome. MVP made sense when the bottleneck was shipping: build the smallest thing that demonstrates the idea, learn, iterate. When AI compresses build time, shipping fast enough to learn is no longer the constraint.

Minimum Productive Outcome asks a different question: what's the smallest thing that produces a real result for the customer? Not a demo. Not a prototype. An outcome. The success metric moves from "did we ship" to "did it work for the person using it."

On pricing: the race to be cheapest in a commoditized market is a losing strategy in any industry. Pricing power comes from solving substantial problems or saving meaningful time. Customers will pay for that. The market is consistent on this point regardless of what the underlying model costs.


The products that matter in five years won't be the ones with the best prompt engineering. They'll be the ones that accumulated proprietary data, developed genuine depth on a customer problem, and built something a user would feel the cost of losing. None of that is prompt-dependent.

How Slop Happens

Casey Newton coined the term "AI slop" in 2024 to describe the flood of AI-generated content spreading across the internet: abundant, superficially competent, and utterly devoid of intent. The term stuck because it named something people were already feeling.

In software, slop takes a different form. It isn't obviously bad code. Obviously bad code gets caught in review. Slop is the code that passes review because each individual change is defensible. It's the tenth reasonable decision in a row that produces a system nobody can explain.

This is what makes AI-assisted slop different from the slop that existed before. The volume is higher, the pace is faster, and the individual outputs look clean. The degradation is architectural, not syntactic. It happens at the level of the whole, not the part.


The most common vector is acceptance without understanding. A developer asks the model for a solution, gets something plausible, and moves on. Done once, this is fine. Done repeatedly across a codebase, it produces architecture nobody designed. Patterns that contradict each other. Abstractions that made sense in one context, then got replicated in five others where they don't. Six months later, someone asks why a service works the way it does and nobody knows. Not because the code is undocumented, but because nobody ever actually decided it should work that way.

ThoughtBot framed this precisely: the tooling starts to control the narrative if you let it. You end up unable to explain your own architecture decisions because you never really made them. You just clicked accept.

Research presented at CHI 2025 found a significant negative correlation between AI tool usage and critical thinking among knowledge workers, a pattern the researchers described as cognitive offloading. The ease of generation diminishes the depth of evaluation. That's the mechanism. Clicking accept isn't laziness. It's a habit the tooling actively encourages.


A second failure mode is subtler: side quest overproduction. Lower friction to prototype means it's easy to spin up something new whenever the main problem feels hard. A developer hits a blocker, pivots to a tangential idea, and builds a working proof of concept in two hours. Multiply that across a team and you get a lot of impressive demos and a main mission that's drifting.

The constraint that used to slow this down was implementation cost. Building something took enough time that you had to decide it was worth building. When that cost drops close to zero, the decision discipline has to come from somewhere else. It doesn't appear automatically.

There's a subtler version of this problem that plays out at industry scale. A 2024 study in Science Advances found that while generative AI increased individual creative output, it significantly reduced collective diversity. People using AI produced work that was more polished but more convergent, gravitating toward the same patterns and solutions. The same dynamic shows up in codebases. AI-generated code tends toward whatever is most common in training data. Every team independently reaching for the same tool ends up with architectures that look alike, make the same tradeoffs, and carry the same blind spots.


Measurement incentives are worth naming because organizations are already falling into this trap. Tracking AI usage, commit volume, or features shipped per sprint: any metric that captures output rather than outcome creates pressure to optimize for the metric. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Engineers who produce more get rewarded. Whether what they produced holds together is harder to measure, so it gets measured less.

The teams getting this right are measuring what they always should have: system reliability, time to resolve incidents, test coverage, how long it takes a new engineer to get productive. AI doesn't change those metrics. It just makes them matter more.


The through-line is that slop is a process failure, not a tool failure. The model does what you let it do. Review catches what the reviewer knows to look for. Architecture stays coherent when someone maintains a view of the whole. None of that happens automatically. It requires the same discipline AI was supposed to free us from, which is why teams that skip it find they've traded one kind of technical debt for a faster-accumulating version of the same thing.

The Moat Shifts to Judgment

Syntax fluency was never the real moat, but it used to matter a lot. Knowing the language well, knowing the framework, knowing how to structure a solution quickly. These created real differences between engineers. AI is collapsing those differences fast. What's left is harder to learn and harder to fake.

Judgment. Knowing what to build, knowing when something's wrong, knowing when to stop. Taste, in the design sense: an internalized sense of what good looks like that operates faster than explicit reasoning.

David Hume argued in 1757 that taste isn't a fixed trait or innate gift. It's a faculty, developed through practice, sharpened by comparison, and refined by honest reckoning with the gap between what you thought was good and what turned out to be. That's a useful frame for engineering too. The senior engineer's judgment isn't mystical. It's accumulated. It took years of building things, breaking them, and paying attention to what failed and why.


There's a split developing in engineering culture worth naming. On one side are what you might call Experimenters: developers who've leaned into AI tooling aggressively, automated as much as possible, and are shipping faster than they ever did. On the other are Guardians: developers who believe understanding code at a fundamental level is non-negotiable, who worry about correctness and maintainability, and who are skeptical of output they didn't trace through themselves.

Both are partially right. Experimenters are correct that speed matters and the tooling is genuinely powerful. Guardians are correct that fast output without understanding produces systems that are fragile, expensive to maintain, and difficult to hand off.

The prediction that Experimenters will win because technology trends toward convenience misses what happens after you ship. Software evolves. Someone has to maintain it, debug it, extend it in ways nobody anticipated. The Guardian's concern isn't abstract. It's a description of what happens two years after the Experimenter ships and moves on.

There's also something the Guardians are protecting that they don't always have words for. Graham Wallas described creative work as having four stages: preparation, incubation, illumination, verification. Incubation is the slow, largely unconscious phase where exposure becomes understanding and understanding becomes judgment. AI collapses that timeline. When you skip it, you don't eliminate the need for judgment. You expose its absence faster. Guardians are instinctively resisting that collapse. The instinct is right even when the reasoning isn't fully articulated.

The engineers building durable careers aren't choosing between these postures. They're learning to hold both.


The disciplines that compound at senior levels haven't changed much, but their relative weight has. Technical skill is still required. What's changed is that it's no longer sufficient on its own.

Product thinking (knowing what's worth building) is harder to acquire than coding ability and more valuable at the margin. Project execution (making sure things actually ship) is underrated in technical culture. People skills, the ability to influence, align, and develop others, have always separated the engineers who advance from the ones who don't.

The biggest gains come from combining these. An engineer who can write excellent code, articulate what the customer actually needs, drive a project through ambiguity, and bring a team along isn't competing on syntax. They never really were.


There's a version of this that gets stated as "the moat is soft skills now" and that framing is wrong. The technical foundation still matters. What's changed is that technical skill is increasingly table stakes, and differentiation happens above it.

One useful analogy: anyone can buy chocolate-making equipment and start a brand today. The ingredients are commoditized, the manufacturing is understood. What Hershey's and Lindt have isn't better cocoa. It's distribution, brand trust, and decades of understanding what makes someone reach for one bar over another. Software is moving toward the same dynamic. The winners won't be the best implementors. They'll be the teams that understood their users better, got there first, and built something people felt attached to.


If someone could generate the implementation for you in an afternoon, what would you still be the best person in the room to contribute? That's the moat.

In a world where everyone has access to the same generative tools, the differentiator isn't what you can produce. It's what you can see.

Three Things I Keep Coming Back To


title: "Three Things I Keep Coming Back To" date: 2026-04-01 authors: - anveo description: "After all the frameworks, three patterns keep reasserting themselves: context is the ceiling, validation is the bottleneck, and seniority is the multiplier." slug: three-things tags: [ai-engineering, software-development] series: "What I Think About AI Engineering" series_order: 10 notion_sources: - "On AI Acceleration & Amplification: https://www.notion.so/3298f83c9129801eb6bbfdd82a442900" draft: false


Three Things I Keep Coming Back To

Context is the ceiling. The limiting factor in any AI-assisted engineering session isn't model capability or prompt quality. It's how much the model actually knows about the system it's working in.

The model might know everything about programming in general and nothing about your system specifically. That gap is where sessions fall apart. Why does this component work the way it does? What implicit rule governs this service? What's the business constraint behind this particular decision? None of that is written down anywhere a model can find. It lives in the heads of the engineers who built the thing, accumulated over years of decisions that didn't feel like decisions at the time.

The investment that pays compound returns isn't better prompting. It's making that context explicit: documented conventions, clear invariants, files that encode how the system works and why. AI with deep context performs like a different tool than AI without it.


Validation is the bottleneck. Several pieces in this series have circled this point from different directions, because it keeps being the thing that's actually constraining teams. AI accelerated code generation before most teams built the validation muscle to keep pace with it.

It was always about the edge cases, verification, and catching regressions. Experienced engineers knew this. "Coding is 20% of the work" is a cliché precisely because it's true and people keep forgetting it. AI just made forgetting it more expensive.

The bottleneck won't be fixed by better models. A more capable model generates more code, which requires more validation, which deepens the bottleneck. The teams that solve it invest in test infrastructure, review culture, and the discipline to treat unproven code as unfinished work. Those investments existed before AI. They're just more urgent now.


Seniority is the multiplier. This one cuts against a narrative worth pushing back on: that AI is an equalizer, that a junior developer with good tooling can produce what a senior would.

In a narrow sense, sometimes true. In a broader sense, it misses where the value actually sits. A senior engineer gets more out of AI not because they're faster with the tools but because they bring context the tools don't have. They know what they want before asking. They catch drift when it happens. Their usefulness is augmented by knowing when the model is wrong, which requires the kind of judgment that only comes from building things, breaking them, and learning what failure looks like up close.

That's not automatable. It's the product of experience, and AI doesn't manufacture experience.