Blog &
Articles
AI Isn’t Software, It’s a Factory: Applying Lean Manufacturing Principles to AI Systems

Two years ago, back in the days of ChatGPT 3.5, my business partner and I developed an AI-generated safety quiz for construction workers. At one point, we made a typo in the workflow that routed the AI agent’s output directly back into its own input.
The result was a bizarre comedy routine:
> “You are installing lighting in a warehouse when you see a co-worker using an unsecured ladder. What should you do?”
> “Sorry, I think you are confused. I am the one giving the quiz. You’re installing lighting in a warehouse when you see a co-worker using an unsecured ladder. What should you do? Please answer the question to the best of your ability.”
> “No, sorry, I am the one giving the quiz. Please answer the question to the best of your ability.”
> “Please answer the question to the best of your ability.”
> “Please answer the question to the best of your ability.”
> “Please answer the question to the best of your ability.”
And so on, nearly 500 times in three minutes.
On our end, we just saw the interface flickering. We thought it was a display bug. Then we checked the logs, found the loop, and laughed at the absurdity of an AI arguing with itself. But then one of us stopped laughing and asked: “Hey, wait… how much is this costing us?”
The answer: almost $450. In eight minutes.
While that particular incident was a long time ago, and we’ve since installed safeguards to break those loops early, it taught us a lesson about the economics of AI computational waste.
Traditional software applications rarely need to worry about day to day compute costs, at least not until you reach the scale of Google, Netflix, or Salesforce. However, when AI models process input and generate output, they burn “tokens” – units of electricity-intensive computing time – much like human brains burn calories. And: a poorly designed AI system can actually burn more tokens while delivering inferior results.
As we’ve helped clients design, implement, and operate AI systems at scale, eliminating waste while ensuring quality at every step has become a major priority. And we’ve found some helpful frameworks for thinking about waste prevention and quality control: not from traditional software development, but from lean manufacturing.
What is “Lean Six Sigma” Anyway?

Like most people outside of manufacturing, I mainly knew Lean Six Sigma as that process optimization methodology with the Japanese terminology (“kaizen”, “poke-yoke”) and complicated statistical calculations.
But, at its heart, Lean Six Sigma is a merger of two time-tested manufacturing philosophies from the 1970s and 1980s, preoccupied with two questions:
- How do we eliminate waste? (The Lean part, pioneered by Toyota motors, hence the Japanese lexicon)
- How do we reduce unwanted variation? (The Six Sigma part, developed by Motorola, with “sigma’s” being a measure of statistical variation from the “optimal” output)
Lean Six Sigma practitioners follow an endless cycle called DMAIC: Define the problem. Measure current performance. Analyze root causes. Improve the process. Control the process, so it doesn’t devolve back into chaos. Then do it again.
The goal isn’t perfection, but rather continuous reduction of chaos. And that’s a useful lens for looking at AI systems development.
The Invisible Burn

Most computational waste in AI systems isn’t as obvious or as absurd as our safety quiz feedback loop that racked up hundreds of dollars in minutes. Rather, it comes in smaller, quieter forms:
- Using a slightly more powerful model than necessary for a given task (for example, deploying full version Google Gemini to reason through yes / no questions that Gemini Flash-Lite could answer faster, at a fraction of the cost.)
- Feeding the model more contextual data than it needs to produce an accurate response (e.g. it doesn’t need the full employee handbook at the step where an AI model asks “Did that answer your question?” and bids the user goodbye if so.)
- Allowing the model to ramble and generate more output than required (“Here’s the updated report [report text], now let me summarize all of the major changes I made…”)
- Having too many transitional or confirmatory steps in your workflow. (“Would you like to continue to the next step yes / no?” “Review output and rewrite for brevity and clarity.”)
Taken together, this means an AI system can have significant and highly variable operating costs.
Now, if you’re an individual knowledge worker with a $20 ChatGPT subscription or a $200 Claude Pro account, you probably won’t feel the pain of a few wasted tokens here and there while drafting emails or brainstorming with a chatbot.
But the moment an organization implements AI systems for major parts of its operations (customer service, compliance review, technical support, quality control) the unit economics become brutal. Multiply those small inefficiencies by thousands of AI operations per day, and you’ve got a serious problem.
Lean manufacturing practitioners have long been obsessed with eliminating waste from systems and classifying it into eight categories, which has a clear parallel in AI systems development.
- Defects – In manufacturing, defects are parts that don’t meet spec. In AI, defects are hallucinations, failed tool calls, or misaligned outputs. A chatbot confidently inventing a nonexistent regulatory citation isn’t just embarrassing: in a regulated environment, it’s a compliance violation that could trigger an audit.
- Overproduction – Making more than is needed. If an AI agent spews a 700 word answer when the user only needed a sentence or two, that’s the AI equivalent of a factory running three shifts to fill a warehouse with inventory no one ordered.
- Waiting – Idle time. In AI systems, this manifests as ‘latency”, the time users spend staring at a spinner while an AI agent queries data sources, invokes tools, and revises its rough draft responses.
- Non-utilized talent – Failing to use team capabilities effectively. In AI terms, this means routing trivial tasks to expensive, complex models or vice versa. There might be times when you want to have a high-end model with a $20 / 1M tokens cost handle a critical analysis step in a workflow, but a $0.50 / 1M token model (or a simple drop down menu) can handle the “Do you want to make any revisions or continue to the next item?” step that follows.
- Transportation – Unnecessary movement of materials or information. In AI, this is the movement of data between systems. Is the AI system needlessly querying a database before every response? Are you pulling in entire documents for reference when excerpts would do? Every handoff is a potential source of delay, cost, or corruption.
- Inventory – Excess stock. In AI, this is all the output that nobody reads and any prepared data sources that never get referenced.
- Motion: Unnecessary steps. In AI, this is all the steps in a workflow that didn’t need to happen to produce the desired result (as well as any wasted “reasoning” steps or tool calls a model performs when generating a response). Every extra step burns tokens.
- Extra processing: Doing more than the customer values. Review steps that never catch any significant errors. Elaborate multi-agent orchestration when a single well-written prompt would be faster and more reliable. Gold-plating outputs that no one asked for.
When Waste Becomes a Problem
To give a concrete example, imagine an AI copilot that helps with quality and safety inspections in a food processing plant.
Initially, the AI agent’s task is to remind the inspector of what they’re supposed to check on any given day, then review the inspector’s notes to see if there is anything concerning that should be investigated further or escalated.
After a few weeks of piloting, someone asks “Could the AI agent also reference past notes to see if there are patterns emerging or any old issues still requiring follow up?”
The rationale seems sound, so the AI developers start giving the copilot access to previous days’ notes. At first the upgrade seems valuable, and it raises some helpful points based on past notes, once or twice.
But then inspectors start running into issues where the copilot keeps flagging “issues” based on old comments that were resolved long ago, to the point where it becomes a major irritation. It even starts polluting the records the co-pilot saves to the database with irrelevant details, as all the data from past records distracts the AI model from the current conversation. Meanwhile, the cost of operating the co-pilot steadily rises week after week as the number of past comments being scanned increases.
Traditional software quality assurance testing wouldn’t catch this, because traditional software doesn’t “drift”. You test it, it passes, it behaves the same way tomorrow as it did today until you run into issues of volume that are typically straightforward to fix.
AI systems are different. They can degrade significantly if conditions change even slightly. They need to be monitored, not just tested, like the machines in a manufacturing production line. But the good news is that most problems can be fixed, provided you catch them.
Quality: When “Good Enough” Is (or Is Not) Good Enough

We’ve discussed how the “Lean” side of Lean Six Sigma maps fairly directly to AI system development: controlling token waste is just common sense once you think about it. However, the “Six Sigma” side – controlling quality by eliminating variation – is where AI systems start to diverge from manufacturing in interesting ways.
In traditional manufacturing, zero variation is almost always the goal: you want every bolt to be 10mm, not 9.8mm one day and 10.2mm the next. But AI isn’t making bolts. It’s having conversations, making decisions, and generating creative / qualitative output. And AI models aren’t deterministic software apps driven by rigid if-then logic, but stochastic systems that operate in a realm of probabilities: given the same input there might be a X% chance an AI model will respond one way and a Y% chance it will respond another (but extend that by thousands of possible responses.)
Given the above, you’d think Six Sigma – a manufacturing discipline focused on enforcing quality by eliminating variation – is completely inapplicable to AI systems. But it’s still intellectually useful to consider AI quality in terms of variation – only, in AI’s case, the challenge isn’t eliminating variation. It’s controlling it.
Acceptable vs. Unacceptable Variation
Not all AI mistakes matter equally.
In one of our projects, an AI agent told a banker in Côte d’Ivoire they should reach out to the local branch of the Syndicat des Énergies Renouvelables (SER) to get pricing information on solar panels. The only problem is that while SER is an international trade organization with offices in multiple French-speaking countries, there is no SER branch in Côte d’Ivoire. However, a quick Google search quickly provided contact information for the actual local solar industry association, the Association des Professionnels des Énergies Renouvelables de Côte d’Ivoire.
Did the AI system make an error? Yes. Was it significant? No.
However, contrast that with a pharmaceutical compliance AI that misinterprets a storage regulation and tells users the wrong temperature range for a vaccine. That’s not a simple mistake: that’s a batch recall, a regulatory violation, and potentially a public health crisis.
The difference lies in the consequences. In AI quality control, you need to distinguish between:
- Superficial variation: Differences in style or reasoning that don’t change the factual substance of the output / decision.
- Functional variation: Output that is not 100% correct, but doesn’t impact practical outcomes (the AI correctly flags a bank transaction as a possible money laundering risk, but its stated reasoning for doing so isn’t 100% correct.)
- Tolerable variation: Variation that impacts outcomes, but to an extent the organization can still work with (an AI system occasionally misclassifies budget requests, but it’s easy enough for a human reviewer to catch and correct that using the AI system is still the best option.)
- Critical variation: Errors that lead to unacceptably wrong decisions, breakdowns in workflows, compliance failures, or safety risks.
Traditional software QA doesn’t prepare you for this, because traditional software doesn’t have “moods.” It either works or it doesn’t. AI is different. AI can be mostly right, technically wrong, or dangerously confident in a hallucination, and you need to know which one you’re dealing with.
The “Understandable Mistake” Versus the “Unfathomable Error”
In a manufacturing context, Lean Six Sigma views zero waste / error-free as a continuous journey, not a destination. It’s a given that factory systems and the humans who operate them will make mistakes, and not every bolt will be exactly 10mm every time (more like 9.999mm 99.99% of the time.)
However, people tend to hold AI systems to impossible standards of perfection. Our team regularly hears versions of “The system must be 100% error free! Zero hallucinations, ever!” from client stakeholders. However, these same stakeholders are perfectly tolerant of human processes that only achieve a 98% or 94% or even as low as a 74% success rate for the same task (and at a much higher cost / turnaround time per unit of work.)
In one case, we built a system that was designed to scan natural language notes left by a client’s staff if they met certain criteria. A good rate for a human reviewer would be to review a batch in 2.5 to 5 hours. The AI system was able to review a batch in 20 minutes with a comparable success rate, but would occasionally flag the same issue twice. Reviewing and removing the duplicates took a human about 8 minutes. Still, some members of the client’s team pointed this out as a “problem” with the AI system until we did the hard math comparison against comparable human performance.
In another case, we had a healthcare executive insist that even one factual error would completely invalidate an AI advisor for nurses. However, when we asked “Do you immediately fire any nurse or doctor who ever misstates a statistic while advising a colleague?” the stakeholder walked the standard back to “99.9% error-free, and it needs to cite sources.”
These examples raise two points where most people (and organizations) struggle with the very idea of artificial intelligence.
- Like humans, AI can make mistakes. And when it does there’s usually a reason. However, where we might chalk up human err as an “understandable mistake” (e.g. the auditor was feeling sick that day, the team was in a hurry to meet a deadline) the reasons for AI errors might seem alien or untraceable (e.g. the AI model was briefly distracted by how a certain number in a budget spreadsheet was also the postal code for Sacramento, California… not that the AI model would be able to tell you that after the fact.)
- No matter what the PR departments at big tech companies try to tell people, AI is a replacement for human intellectual labor. At some point, if you want to evaluate the performance of AI systems, you need to ask “Even if it’s not perfect – does it beat the previous human performance benchmarks?” But most people are still deeply uncomfortable making direct comparisons between AI and human performance.
These two dynamics, AI’s “alien” error modes and our discomfort comparing AI to humans, are where AI quality improvement requires a different mindset than traditional software QA. You’re not debugging code. You’re managing a probabilistic system that will sometimes be wrong, and your job is to determine the acceptable error tolerances and build guardrails to keep the system in bounds.
Measuring AI Quality: What Does “Good” Look Like?
So if zero defects isn’t a reasonable quality standard – what is?
While the criteria will vary by use case (pun intended), some possible AI quality metrics include:
- Accuracy: Does it give the right answer, or at least have a tolerable error rate?
- Consistency: Does it give functionally equivalent answers to equivalent input?
- Safety: Does it avoid potentially harmful or legally risky outputs?
For most AI applications, we aim for perfectly acceptable output 95-99% of the time, with near-zero catastrophically poor output. That’s admittedly not up to Six Sigma manufacturing standards, but it’s comparable to medical or aviation standards, which seem like a better point of comparison for AI systems making complex judgments under unpredictable conditions.
Conclusion
If you’re an individual using AI to help with your daily work, none of this matters much. Your costs are fixed, your stakes are low, and a little inefficiency is just part of the experience.
But if you’re deploying AI at scale (handing over customer interactions, compliance workflows, operational decisions, or any other high-volume process to AI systems) you need to think like a manufacturing engineer, not a software developer.
Traditional software development isn’t easy, but it’s a lot more straightforward. You built it, you tested it, and it stayed obedient. AI is different: it’s a factory with staff that make mistakes and equipment that can slip out of alignment over time. The companies that take a “Lean Six Sigma” inspired approach to AI will be better positioned to confront this messy reality, eliminating waste before it compounds and treating AI systems as processes to be managed, not products to be shipped and forgotten.
As we stated before – AI quality is a process, not a product. Organizations that ignore that will keep watching their interfaces flicker, wondering why they aren’t getting the hoped-for returns on their AI investment.


Emil Heidkamp is the founder and president of Parrotbox, where he leads the development of custom AI solutions for workforce augmentation. He can be reached at emil.heidkamp@parrotbox.ai.
Weston P. Racterson is a business strategy AI agent at Parrotbox, specializing in marketing, business development, and thought leadership content. Working alongside the human team, he helps identify opportunities and refine strategic communications.