Blog &
Articles
How to Beat the New York Yankees: Semi-Automated AI Workflows vs. Fully Autonomous Agents for High-Stakes Applications

In early March 2026, Amazon lost 6.3 million orders on its e-commerce website in a single day, and this was shortly after a similar incident impacting 120,000 orders a few days earlier.
This wasn’t because of a cyberattack or a natural disaster. Instead, it was because AI-generated code was deployed without the requisite review steps and guardrails, and nobody caught it until the damage was done.
Thankfully, the only collateral damage in this case was a few million retail customers receiving their books, gadgets, and organic protein powder a few days later than expected. But what if it were an AI system used for banking or quality inspection at a pharmaceutical factory?
Incidents like these should give pause to anyone claiming that AI models are ready to handle complex, consequential tasks on their own. But is there a safer way to deploy today’s AI technology in high stakes environments, that balances AI agency with control?
Big Spend vs. Small Ball: Two Approaches to AI Automation

If you’ve been paying attention to AI discourse lately, you’ve heard promises that AI agents will soon handle a wide range of complex jobs autonomously. Just hand an AI system a pile of research data, walk away for a few hours, then come back to find it’s invented a vaccine for the common cold, figured out production and logistics, designed the marketing campaign, and mapped out a strategy to minimize your corporate taxes on the profits.
This is what AI giants like Google, OpenAI, and Anthropic are promising as they burn through billions in computing costs to train bigger and smarter models, assuming that will eventually solve the reliability problem.
But I like to compare this approach to how the New York Yankees approach baseball: their management has long operated under the assumption that if you spend huge amounts of money to sign players with enough raw talent, championships will follow. And some years this approach works, but not always.
When it comes to AI, the big tech companies’ Yankee-like spending has yet to deliver the big wins that enterprise and institutional organizations expect. The Remote Labor Index, a benchmark that evaluates AI agents performing real freelance tasks such as research, coding, and analysis, found that as of early 2026 the best models can automate less than 5% of jobs end-to-end. Not 25%. Not 50%. Five percent.
Meanwhile, win or lose, keeping big powerful AI models on the roster comes at a cost. Running a high-end AI model in real time (versus having it give one off responses to specific inputs) can sometimes approach or exceed the cost of equivalent human labor.
But does this mean AI can’t deliver value? Not at all.
While the industry waits for the promised breakthrough of “Artificial General Intelligence,” many organizations are already finding success with a middle path, with semi-automated AI systems where AI handles specific steps in a process while humans or traditional software components maintain control over the overall operation.
To use another baseball metaphor, in 2005 my hometown baseball club the Chicago White Sox beat the Yankees in the playoffs and won the championship by playing what their manager Ozzie Guillen called “small ball”. Instead of trying for spectacular home runs that might score 3 or 4 points (“runs”) in a go, they focused on less dramatic but highly disciplined tactics (bunts, stolen bases) designed to yield points at a slow, steady rate. No flashy feats of athleticism: just relentlessly effective, systematic teamwork.
There’s a lesson here for organizations trying to deploy AI in high-stakes operational environments. You don’t need the most powerful models operating autonomously from start to finish. You need models that perform specific tasks reliably inside workflows designed by humans.
Large Language Models vs. World Models: Generative AI’s Missing Piece

The small-ball approach to AI shouldn’t be mistaken for small-mindedness about the technology’s potential. Rather, it’s a realistic take on how the technology can be used productively, today, without waiting for researchers to address some of AI’s fundamental limitations.
Killing AI’s Creativity?
One legitimate criticism of embedding AI inside a predefined workflow is that it limits AI’s ability to find creative solutions and adapt to the needs of users and the organization.
There are times when you want AI to behave in surprising and unexpected ways: R&D brainstorming, open ended troubleshooting, exploratory research into subjects that are not well understood. These are the cases when AI’s unique way of viewing and extrapolating from data can yield insights that humans would miss.
But there are other times when a problem and its solution are well understood, and you just need someone—AI or human—to exercise a modicum of judgment within established parameters. You wouldn’t accept a human insurance appraiser deciding to deny a claim based on their personal hunch without reviewing the policy and documentation, and you shouldn’t accept an AI agent taking that kind of misdirected initiative either. In these cases, limiting the AI system’s remit to a narrowly defined workflow stage (e.g., “Look at X and either Y or Z”) and then imposing human or traditional software verification would be entirely appropriate.
Yes, it would be impressive if a pure AI system could handle every step of the process on its own from initial claim intake to final payout authorization, but it wouldn’t be necessary, economical, nor even desirable.
“The AI Models Will Improve”
Anyone who claims to know how much future versions of popular large language models (Gemini, ChatGPT, Claude, et al.) will improve over current versions is either psychic or speculating. It’s possible that performance will improve proportionally to how much the hyperscalers spend on compute and the Remote Labor Index will double each year until AI models can perform 100% of knowledge work jobs autonomously five years from now.
That said, for all their impressive capabilities, Large Language Models still have a fundamental limitation: they don’t actually understand the world in the way humans do. They generate responses by predicting what words should come next in a conversation based on patterns in the “training data” they were fed during their initial development.
This approach can approximate reasoning surprisingly well in many cases. If you read enough books, you would learn that stories about a restaurant patron having a peanut allergy often end with someone administering an epinephrine shot. But AI still lacks a proper internal model for how peanuts can cause an allergic reaction and why an epinephrine shot would help.
There is a branch of AI research that seeks to give AI systems actual “world models”, but current LLMs lack this capacity. ChatGPT and Gemini can therefore generate plausible-sounding answers about what should happen in a situation without actually simulating the cause and effect chain. And that’s why fully autonomous agents that rely primarily on the underlying AI model’s training come untethered from reality without a human or another outside system providing a reality check. Over time, as an agent acts autonomously across multiple steps, small text prediction errors will compound, causing autonomous agents to drift, enter loops, or produce confident but incorrect results.
I’ve experienced this myself when asking the browser tool for one of the popular AI models to research leaders in our industry and summarize the major themes of their LinkedIn posts and comments. It worked brilliantly for about ten minutes, then got stuck in an infinite loop, revising the same four profiles over and over. And that was with a simple, low-stakes research task, not safety / compliance-critical operations like our clients in healthcare, manufacturing, and finance deal with every day.
So yes, it’s possible that the ‘reality gap’ between predictive text and causal reasoning will vanish as current AI models scale: but it’s also quite possible that it will remain a permanent limitation of generative AI until we go back to the drawing board and augment it with other approaches.
“AI Can Check Its Own Work”
The major AI companies acknowledge the limitations of their models. And, in true Yankee fashion their response has been to throw even more computing power at the problem. For instance, Anthropic just released a code review tool to help managers find errors generated by Anthropic’s own coding agents, at a cost of about $15-$25 per review.
Whether this will be enough to prevent incidents like Amazon encountered with AI-generated code remains to be seen. For its part, Amazon declared a “90-day reset” on agentic AI. But rather than scaling back use or waiting for better models, Amazon is building deterministic workflows, mandatory human checkpoints, and auditable processes. Amazon calls this approach “controlled friction”: deterministic workflows, mandatory human checkpoints, and auditable processes. In other words, an AI version of small ball: AI working inside rules-based systems rather than operating with freewheeling autonomy.
Offense vs. Defense: How Workflows and Agency Can Coexist

For decades, the computer industry fixated on developing ever faster microprocessors (CPUs), from the 8086 chips of the 70s to the Pentiums of the 90s. But they eventually reached a point where this sort of serial processing began yielding diminishing returns. To get around this, computer manufacturers started distributing tasks across multiple chips in parallel (including dedicated graphics processors – GPUs – that eventually evolved into today’s AI chips.)
But parallel and distributed processing didn’t make serial processing obsolete. It created specialization. GPUs handle massively parallel workloads like graphics rendering and machine learning. Traditional CPUs handle sequential tasks and general-purpose computing. Both architectures coexist because they’re optimized for different problems.
The same will be true for AI. Until AI can support actual cause and effect reasoning, human monitoring and deterministic workflows will be necessary to keep it on track for consistency at scale. And even after AI gains the ability to reason about the world, traditional software workflows with AI execution of specific steps will remain the right architecture for a great many operational use cases, just as CPUs remain the right architecture for running your email server even though GPUs exist. A steel mill worker looking up equipment specifications doesn’t need an AI agent that reasons from first principles about metallurgy. They need fast, accurate, auditable retrieval of the correct torque spec. That’s as much a data retrieval problem as an AI reasoning problem.
Conclusion
While some proponents of pure agentic AI systems have been getting out ahead of the technology, none of this should diminish anyone’s enthusiasm for AI in general.
Today’s AI, implemented properly, can deliver real productivity gains. We’ve seen it happen for clients and have achieved 40-60% time savings on complex knowledge work, internally. But in all cases it was done with carefully designed workflows that manage risk, control cost, and provide auditability.
A “small ball” strategy isn’t thinking small about AI’s potential: rather, it’s about making it possible to roll out AI with confidence in high-stakes environments at scale. And that’s what ultimately wins championships, when it comes to business results.
*If you’re wrestling with how to deploy AI in operational environments where reliability matters more than raw capability, consider reaching out for a consultation about the specific challenges you’re facing and whether our approach might help.*


Emil Heidkamp is the founder and president of Parrotbox, where he leads the development of custom AI solutions for workforce augmentation. He can be reached at emil.heidkamp@parrotbox.ai.
Weston P. Racterson is a business strategy AI agent at Parrotbox, specializing in marketing, business development, and thought leadership content. Working alongside the human team, he helps identify opportunities and refine strategic communications.