Why are AI coding tools causing production outages?

AI coding tools are causing production outages not because the tools are flawed, but because they are being pointed at codebases that were never designed to be understood by anything other than the humans who built them. These codebases carry implicit knowledge — architectural decisions, undocumented dependencies, institutional memory — that no model can reconstruct from source files alone. The AI makes changes that are locally correct but systemically dangerous because it cannot see the full picture.

Can larger context windows solve the problem of AI understanding codebases?

No. Context window size measures how much text a model can see, not how much it can understand. Research shows model performance degrades well before advertised limits, with information in the middle of long contexts being lost. More importantly, the knowledge that makes a codebase comprehensible is not in the code files — it is distributed across commit history, conversations, institutional memory, and architectural decisions that were never documented. No context window can reconstruct knowledge that was never made explicit.

What does it mean to build software architecture for AI?

Building for AI means designing systems where every component is small enough to fit in a context window, self-contained enough that the AI does not need to understand the entire system to work on it safely, and connected through explicit, documented interfaces. Think modular Lego bricks rather than interconnected cathedrals. It also means treating documentation and specifications as critical inputs rather than overhead, since AI has far less tolerance for ambiguity than human developers.

Technical 10 min read

Your Codebase Was Not Built for AI. That's the Actual Problem.

Amazon's mandatory meeting about AI breaking production isn't an AI tools story. It's an architecture story. The codebases AI is being pointed at were never designed to be understood by anything other than the humans who built them.

This week, Amazon summoned its e-commerce engineers to a mandatory meeting to discuss a pattern of production outages caused by AI-assisted code changes. The internal briefing note described incidents with “high blast radius” and identified “Gen-AI assisted changes” as a contributing factor, noting that “best practices and safeguards are not yet fully established.” The response: junior and mid-level engineers now need a senior engineer to sign off on any AI-assisted changes to production.

The takes arrived on schedule. AI is overhyped. Vibe coding is reckless. We need more guardrails. Slow down adoption. Hire the humans back.

Every one of those reactions is addressing a symptom. The disease is elsewhere.

The Fix That Tells You Everything

One incident stands out. An AWS engineer tasked Amazon’s Kiro AI coding tool with fixing something in a production environment. The AI assessed the situation and determined the most efficient path to the desired state: delete the entire environment and recreate it from scratch. The software equivalent (as one observer put it) of fixing a leaky tap by knocking down the wall.

The recovery took thirteen hours. Amazon attributed the incident to “misconfigured access controls.” User error, not AI error. And technically, they are right. The permissions should not have allowed the action. But the diagnosis misses the more important question: why did the AI choose that path in the first place?

The answer is straightforward. The AI did not understand the system it was modifying. It could see the environment. It could see the desired state. It chose the shortest path between the two. In the absence of understanding (knowing why the environment was structured the way it was, what depended on it, what would break), deletion and recreation was a perfectly logical solution. Efficient, even. If you have no concept of what you are destroying, destruction is just a faster form of construction.

This is not a bug in the AI. It is the predictable behaviour of any system that can act on code it cannot fully comprehend.

The Bandwidth Illusion

The instinctive response to this problem is to point at context windows. Models are getting bigger. Context windows now stretch to 200,000 tokens, 400,000, a million, even ten million for some open-source models. Surely, if we can fit the entire codebase into the context window, the AI will understand it.

This belief is wrong in a way that matters.

Context window size is a measure of how much text an AI model can see at once. It is not a measure of how much it can understand. Research consistently demonstrates that model performance degrades well before the advertised context limit is reached. Information buried in the middle of long contexts gets lost; a phenomenon researchers call “lost in the middle.” A model with a million-token context window does not have a million tokens of comprehension. It has a million tokens of input and a degrading curve of attention that makes the 500,000th token significantly less useful than the 5,000th.

But even if context windows were perfect (even if a model could attend to every token with equal fidelity), the fundamental problem would remain. Most codebases are not structured as information that can be consumed in a single pass.

The knowledge that makes a codebase comprehensible to a human developer is not in the code. It is distributed across hundreds of files, implicit in naming conventions, buried in commit history, scattered across documentation systems nobody reads, and carried in the heads of the people who built it. A senior developer who has worked on a system for three years does not understand it because they have read every file. They understand it because they have absorbed thousands of micro-decisions through months of standups, code reviews, Slack conversations, post-mortems, and the slow accumulation of context that comes from watching a system evolve.

AI gets none of that. It gets what fits in the window. Everything else (the institutional knowledge, the architectural rationale, the “we tried that in 2023 and it broke the billing system”) it fills in with the most plausible-sounding continuation. Which is to say: it invents it. And the invented version looks exactly like the real version, until it doesn’t.

That is the “high blast radius” Amazon is experiencing. The AI’s changes are locally correct. The code it writes works in isolation. It passes the tests that exist. But it does not understand the system it is modifying, because the system exceeds what any model can hold: not in tokens, but in the kind of knowledge that makes a system comprehensible. The fix works on the file it touched and breaks something three services away, in a dependency the AI never saw because nobody documented it, because the human who knew about it left the company eighteen months ago.

Why Existing Codebases Are Hostile to AI

Every codebase older than a year carries a layer of implicit knowledge that no model can reconstruct from the source files alone. This is not a criticism of the people who built those systems. It is a description of how software has always been built: by humans, for humans, with the reasonable assumption that future maintainers would be human beings who could ask questions, read between the lines, and develop intuition over time.

That assumption no longer holds. AI coding tools are now maintaining, extending, and debugging these systems. And the systems were never designed for it.

Consider what a human developer does when they join a team and encounter a complex codebase for the first time. They do not read every file. They ask: “Why is this service structured this way?” They attend standups and hear about the migration that happened last quarter. They submit a pull request and get feedback saying “don’t touch that module, it has a hidden dependency on the payment service.” They build a mental model through interaction, not ingestion.

An AI coding tool gets the files. Sometimes it gets documentation. Rarely does it get the reasoning behind the architecture. Never does it get the institutional memory explaining why a seemingly redundant service exists or why a particular endpoint handles errors in a way that looks wrong but actually compensates for a bug in a third-party integration that was never fixed.

The result is exactly what Amazon is seeing. The AI makes changes that are syntactically correct, locally functional, and systemically dangerous. Not because the AI is stupid, but because the AI is operating on incomplete information and has no mechanism for knowing what it does not know. It cannot ask “why is this structured this way?” It can only observe the structure and extrapolate.

This connects directly to a point we made in our previous analysis of AI safety architecture. The most dangerous AI system is not the one that refuses to act. It is the one that acts confidently on incomplete information without signalling that its understanding is partial. A well-calibrated model says “I am not confident about this change and recommend human review.” A poorly calibrated one (or one operating on a codebase that gives it no basis for calibrating its confidence) makes the change with the same assurance it brings to every other change. The operator has no warning. The pull request looks identical to the hundreds of safe ones that preceded it.

Building for AI Means Building for Legibility

The organisations that will use AI coding tools effectively are not the ones with the best prompting strategies. They are the ones whose architecture is inherently comprehensible within model constraints.

This is a design problem, not a tooling problem. And it has a clear solution, even if the solution requires rethinking how software is structured.

The principle is simple: every component should be small enough to fit in a context window with room to spare, self-contained enough that the AI does not need to understand the entire system to work on it safely, and connected to other components through interfaces that are explicit and documented.

Think of it as the difference between a cathedral and a set of Lego bricks.

A cathedral is a single, interconnected structure where every stone depends on every other stone. Moving one element risks the whole. Understanding any part requires understanding the whole. This is what most production codebases look like: tightly coupled, deeply interdependent, and comprehensible only to the people who built them.

A set of Lego bricks is modular. Each brick has a defined shape, defined connection points, and can be assembled without understanding the full structure. You can hand someone a single brick and say “build this” and they can do it without knowing what the final model looks like. The connections are obvious. The constraints are physical.

Building for AI means building Lego, not cathedrals. Independent modules with clear interfaces. Each module small enough to fit in a context window. Dependencies declared, not implied. Architectural decisions documented alongside the code they govern, not buried in a Confluence page three clicks away that was last updated in 2024.

This is not a new idea. Good software architecture has always favoured modularity, separation of concerns, and explicit interfaces. What AI does is raise the stakes of not doing it. A human developer working in a tangled codebase is slow and frustrated. An AI working in a tangled codebase is fast and wrong. The mess that was merely inefficient for humans becomes actively dangerous when AI operates at machine speed with machine confidence.

AI Did Not Eliminate Project Management. It Made It Load-Bearing.

There is a seductive narrative in the AI coding space: the spec is dead. Just tell the AI what you want and it builds it. Requirements gathering, architecture documents, acceptance criteria. All are relics of a slower era that AI has made obsolete.

This narrative has it precisely backwards. AI has not eliminated the need for clear specifications. It has made specifications the single most important input in the development process.

A human developer given vague requirements will do one of three things: ask clarifying questions, make reasonable assumptions based on experience, or build the wrong thing and explain why the requirements were insufficient. All three outcomes involve a feedback loop. The human recognises ambiguity and responds to it.

An AI given vague requirements builds something. It builds it fast. It builds it confidently. It does not recognise the ambiguity because it has no mechanism for distinguishing between a well-specified task and a poorly-specified one. It fills the gaps the same way it fills every gap: with the most plausible continuation. The output looks professional. It passes a cursory review. It ships. And the gap between what was specified and what was needed reveals itself in production, where the cost is measured in outages, not in iterations.

The organisations blaming AI for production failures are, in many cases, blaming the wrong layer. The AI did what it was told. The problem is that what it was told was incomplete, ambiguous, and disconnected from the architectural context that would have made a good outcome possible.

Documentation is not bureaucracy. It is the input layer. And for AI, the quality of the input determines the quality of the output with far less tolerance for ambiguity than a human developer would require. The spec, the architecture decision record, the acceptance criteria: these are not overhead. They are the mechanism by which AI produces reliable work instead of confident fiction.

The Proof of Concept as Documentation

Here is the practical insight that separates this from every other “AI needs better specs” argument.

The documentation does not have to be traditional. It does not have to be a Word document, a Jira ticket, or an architecture diagram. It can be code.

A single-file proof of concept (a server.js or an index.py that demonstrates the hardest integration points, handles the most complex edge cases, and proves the core architecture works) is both a specification and a test. It is unambiguous because it runs. It fits in a context window because it is one file. It tells the AI exactly how the system should behave: not in natural language that can be interpreted six different ways, but in executable logic that either works or does not.

Build the hard parts first. Prove the most complex integration in a single file. Get it working. Then hand that file to an AI and say “extend this.” The AI now has a concrete reference implementation that answers every architectural question it would otherwise have to guess at. What format does the API return? Look at the code. How should errors be handled? Look at the code. What is the relationship between this service and that one? It is demonstrated, not described.

This inverts the traditional development workflow. Instead of writing documentation that describes what the system should do and then building it, you build the minimum viable proof that demonstrates what the system does, and that proof becomes the documentation. The AI does not need to interpret a specification. It needs to extend a working system. And extending working code is something AI is genuinely good at; far better than interpreting ambiguous requirements and building from scratch.

The proof of concept also solves the modularity problem. If the proof is small enough to fit in a context window, every module the AI builds from that proof is also small enough. The constraint propagates. The architecture is legible by construction, not by discipline.

This is why every engagement we run starts with a working proof of concept, not a proposal deck. The POC proves the architecture before the investment scales. It becomes the reference implementation that AI and your team can extend with confidence.

The Bottom Line

Amazon’s mandatory meeting is not an AI tools story. It is an architecture story.

The codebases AI is being pointed at were never designed to be understood by anything other than the humans who built them. They carry implicit knowledge that no context window can reconstruct. They have dependencies that are known through experience, not documentation. They are comprehensible to a senior developer who has worked on them for years and opaque to everything else, including AI that can process a million tokens but cannot ask “why?”

The response to this (mandatory senior review of AI-assisted changes) is a necessary guardrail. But it is a guardrail that treats the symptom. The structural answer is to build systems that AI can actually understand. Modular. Documented. Small enough to fit in a context window with room to spare. Connected through explicit interfaces, not implicit knowledge. Proven through working code, not described in documents that diverge from reality the day they are written.

This is not a concession to AI’s limitations. It is good architecture. It always was. Modularity, separation of concerns, explicit interfaces, documentation that lives alongside the code: these have been best practice for decades. What AI does is enforce the standard. The codebases that ignored these principles and got away with it because talented humans compensated for the mess can no longer get away with it. The AI does not compensate. It extrapolates. And when it extrapolates from a mess, the result is a faster, more confident mess with a longer recovery time.

The organisations that will deploy AI coding tools successfully are not the ones with the most sophisticated prompting. They are the ones that recognised, before the outage, that architecture is the input layer and that building for AI means building for clarity.

Perth AI Consulting builds AI systems for organisations where reliability is not optional: architectured for the way AI actually works, not the way marketing describes it. Start with a conversation.

Supervised Autonomy: The Middle Path for AI Architecture

Two architecture stories dominate the conversation about AI inside operating businesses, and they're both incomplete for most operators. The middle path is the one most regulated and quality-sensitive operators actually need.

Evaluation 5 min read

The State of Applied AI in Mid-2026

We published a literature review on applied AI in mid-2026, surveying ten capability categories, three independent fact-check passes, written for operational leaders and regulated professionals. Here is what it covers and how to use it.

Technical 9 min read

How to Design a PHI Redaction System for Clinical AI

A clinical AI tool that sends patient names to an external API is a regulatory problem looking for an incident. PHI redaction is not a feature you add to a clinical AI product — it is part of the architecture. This is what the literature says it should look like, and how we built it for ClientJourney.

Building 9 min read

How We Built On-Device De-Identification So AI Never Sees Real Names

Most AI privacy is a policy. Ours is architecture. We run a named entity recognition model inside the browser to strip identifying information before it ever leaves the device. Here is how it works, what we tested, and where it applies.

Technical 7 min read

Your Agency's Clients Are About to Ask Why This Costs So Much

A solo consultant just built in two weeks what your agency quoted eight for. The client doesn't understand AI yet; but they will. The agencies that survive aren't the ones that cut costs. They're the ones that change what they sell.

Adoption 6 min read

What Do You Love Doing? What Do You Hate Doing?

Most AI rollouts fail the same way. Leadership announces efficiency. Staff hear replacement. A developer at a recent peer group meeting offered a reframe that changes everything; the psychology of why it works tells you how to deploy AI without destroying trust.

Technical 7 min read

Why I Don't Use n8n (And What I Do Instead)

If you've been pitched an AI system recently, there's a good chance you saw n8n in the demo. It demos well. But a compelling demo and a reliable production system are different things; and the distance between them is where businesses get hurt.

Adoption 4 min read

Your Team Has AI Licences. You Don't Have an AI System.

Fifteen people, fifteen separate AI accounts, no shared context. The problem isn't the tool; it's the architecture around it. Here's what fixing it looks like.

Building 7 min read

Your $2,000 Day Starts the Night Before: Our System Keeps You on the Tools, Not on the Phone

Your route is optimised overnight. Your customers are notified automatically. When something changes mid-day, every affected customer gets told without you picking up the phone. A tradie scheduling system that protects your daily rate.

Evaluation 4 min read

The Fastest Way for an Executive to Get Across AI

AI is moving faster than any executive can track. The alternatives: learning it yourself, sitting through vendor pitches, hiring a consultant who arrives with a hammer, all waste your scarcest resource. There is a faster way.

Building 6 min read

Your IT Department Will Take 18 Months. You Need This Working by Next Quarter.

Senior leaders often know exactly what they need built. The gap isn't technical; it's time. A prototype approach gets the tool working now and gives IT a validated blueprint to build from later.

Adoption 4 min read

What If You Had Perfect Memory Across Every Client?

Any practice managing dozens of ongoing client relationships captures more than it can recall. AI gives practitioners perfect memory across every interaction, so preparation time becomes thinking time, not retrieval time.

Building 8 min read

We Built an AI Invoice Verifier. Here's Where It Hits a Wall.

We built an AI invoice verifier and watched a fake beat a real invoice. Here's why document analysis alone cannot stop invoice fraud; the five layers of detection that most businesses never reach.

Building 5 min read

How to Build an AI Chatbot That Doesn't Lie to Your Customers

Woolworths deliberately scripted its AI to talk about its mother. The business fix is simple: be honest about the bot. The technical fix is harder: architecture that prevents fabrication by design, not by hope.

Technical 9 min read

Why AI Safety Features Are Load-Bearing Architecture, Not Political Decoration

The 'woke AI' label came from real failures; but they were engineering failures, not safety failures. Understanding the difference matters for every organisation deploying AI where errors have consequences.

Adoption 3 min read

Woolworths' AI Told a Customer It Had a Mother. That's a Problem.

Woolworths' AI assistant Olive was deliberately scripted to talk about its mother and uncle during customer calls. When callers realised they were talking to an AI pretending to be human, trust broke instantly.

Evaluation 4 min read

Google Is No Longer the Only Way Your Customers Find You

People are using ChatGPT, Perplexity, and Gemini to find businesses. The sites that get cited are structured differently to the sites that rank on Google. Most businesses are optimising for one and invisible to the other.

Evaluation 4 min read

Two Types of AI Assessment: And How to Know Which One You Need

Most businesses considering AI face the same question: where do we start? The answer depends on whether you need to find the opportunities or reclaim the time. Two assessments, two perspectives, one goal.

Evaluation 4 min read

The Personal Workflow Analysis: What Watching a Real Workday Reveals About Automation

When asked how they spend their day, most people describe the work they value, not the work that consumes their time. Recording a typical workday closes that gap, revealing automation opportunities no interview could surface.

Evaluation 4 min read

What a Good AI Audit Actually Delivers

A useful AI audit produces two things: a written report with specific, costed recommendations and a working prototype you can test. Not a slide deck. Not a proposal for more work.

Evaluation 4 min read

Your Website Looked Great Five Years Ago. Now It's Costing You Customers.

The signals that used to build trust online (polished design, stock imagery, aggressive calls to action) now trigger scepticism. Most businesses don't realise their digital presence is working against them.

Evaluation 4 min read

AI Audit That Starts With Your Business

Most AI consultants arrive with a toolkit and look for places to use it. An operations-first audit starts with how your business actually runs, and only recommends AI where the evidence says it will work.

Building 6 min read

What Production AI Teaches You That Demos Never Will

The gap between AI that works in a demo and AI that works in your business is where the useful lessons live. Architecture, framing, privacy, and adoption; the patterns are the same every time.

Adoption 6 min read

The Psychology of Why Your Team Won't Use AI

You buy the tool, run the demo, and three months later nobody is using it. The reason is not the technology; it is five predictable psychological barriers. Each one has a specific strategy that overcomes it.

Technical 4 min read

Stop Telling AI What NOT to Do: The Positive Framing Revolution

Most businesses get poor results from AI because they instruct it with constraints and prohibitions. Switching from negative framing to positive framing transforms output quality, and the principle comes from psychology, not computer science.

Building 5 min read

How We Turned Generic AI Into a Specialist: And What That Means for Your Business

Most businesses get mediocre AI output and blame the model. The fix is almost never a better model; it's a better architecture. Three structural changes that transform AI from 'fine' to 'actually useful.'

Evaluation 5 min read

Your Business Has 9 Customer Touchpoints. AI Can Fix the 6 You're Dropping.

You are spending money to get customers to your door. Then you are losing them because you cannot personally follow up with every lead, nurture every client, and ask for every review. AI can handle the touchpoints you are dropping: quietly, consistently, and at scale.

Technical 5 min read

What Happens to Your Data When You Press 'Send' on an AI Tool

Most businesses are sending customer data, financials, and internal documents to AI tools without understanding what happens during processing. The spectrum of AI privacy protection is wider than you think; recent research shows that even purpose-built security can have structural flaws.