Goldilocks and the Three AI Yahoos

Goldilocks and the Three AI Yahoos Loosely based on real events It was a bright sunny morning when Goldilocks showed up at the table. Three bears were waiting for her. Mama Bear, clearly the leader of the group, spoke no English — or at least attempted to, with results that were difficult to characterize as language. Papa Bear was the hand-waver. Whenever something went wrong, or whenever something couldn’t be explained by someone who couldn’t actually explain anything, the hand-waving would begin. And then there was Baby Bear, whose role in the proceedings remained a mystery from beginning to end.

It was with some surprise, then, that Goldilocks learned she had been summoned as the stakeholder for a new chatbot — and that today’s agenda was a live demo, with her opinion being sought at the end of it. And so the demo began. With great pride, one of the bears typed a query into the chat:

“I own a 1985 Jeep YJ and the right windshield wiper is not working. What should I do?”

The chatbot, lovely as it was, responded immediately.

“Oh yes, I can help you with that — give me just a moment.”

It followed up with a clarifying question:

“was that the Sport model?”

“Why yes, it was.”

The chatbot gleefully proceeded. It turned out, the system explained, that the 1985 Jeep YJ Sport has two windshield wiper motors — one left, one right. Based on the symptoms described, the right motor was likely the culprit. The recommendation: purchase a replacement motor. Goldilocks, who happens to be something of a Jeep expert, looked up from the screen.

“That’s not right,” she said. “That model doesn’t have two motors. It’s a single motor with connecting arms to both wipers.”

The chatbot had simply made it up. Mama Bear uttered something. Papa Bear’s hands began to move. “Well, we’re not done yet,” came the response, followed by the statement that left everyone at the table briefly speechless: “We’ll have the whole thing working by NEXT WEEK.”

As it happened, Goldilocks had not been idle in the weeks leading up to this moment. She had been working with an AI coach — someone who knew a thing or two about the technology and had been helping her explore practical, real-world applications of AI. Micro-AI, specifically. The kind that actually works.

So when the chatbot confidently hallucinated a second wiper motor into existence, Goldilocks didn’t just shrug and move on. She had questions. “So,” she asked, pleasantly enough, “which model are you using?”

The three bears looked at one another. It was the kind of look that travels around a table when nobody wants to be the first to speak. Then, for the first and only time that morning, Mama Bear produced an intelligible phrase.

“OpenAI!”

Goldilocks waited a moment. “Okay — so that’s the model you’re using?”

“Yes. OpenAI.”

It was not entirely clear whether this was an answer or simply the only two syllables available. Goldilocks, to her credit, kept her composure. Her AI coach had prepared her for many things. This, somehow, had been one of them.

Goldilocks, to her credit, pressed on. She had come this far.

“This chatbot is for my team to use,” she said, her tone still measured. “You’re telling me it’ll be done next week. Did you know I’m the primary stakeholder?”

She paused.

“So tell me — what is this system actually supposed to do? Where did you get your requirements?”

The table went quiet.

If the wiper motor question had produced hand-waving, and the OpenAI question had produced a two-syllable non-answer, this question produced something new entirely: the glaze. Three sets of bear eyes went to a place that was not the conference room, not the laptop screen, and not Goldilocks. They went somewhere distant and unreachable, the cognitive equivalent of a test pattern.

What followed could generously be described as a conversation. It was not that. It was a bear beating - a slow, patient, methodical process in which Goldilocks asked direct questions and received responses that were adjacent to answers in the same way that a parking lot is adjacent to a destination. You can see where you’re trying to go. You are not there.

Requirements, it emerged, had not been formally gathered. A stakeholder had not been consulted. The primary user of the system had not been asked what the system was supposed to do. The chatbot had been built in the direction of something, by someone, for reasons that remained, at this point in the demo, entirely unestablished.

Mama Bear said something. Papa Bear’s hands were already moving.

When the call ended, Goldilocks sat for a moment in the quiet. Her head hung, not in defeat exactly, but in the particular sadness of someone who sees the ending of a story before it’s finished being written. She had watched this movie before — not this specific movie, but this genre. She knew how it ended.

The project would stumble. Deadlines would be missed or, worse, met with something that didn’t work. Someone would write a LinkedIn post. The post would get three hundred likes and a comment section full of people nodding gravely about the state of enterprise AI. The headline would blame the technology. But Goldilocks knew better. Her AI coach had been clear on this point, and the morning’s events had done nothing but confirm it.

This was not a failure of AI.

The model hadn’t failed. The technology hadn’t failed. What had failed was everything around it - the management, the process, the fundamental discipline of understanding what you’re building and who you’re building it for before you build it. Requirements that were never gathered. Stakeholders that were never consulted. Complex systems handed to people who had not yet developed a working understanding of what complex systems demand.

AI, it turns out, is not magic. It does not compensate for the absence of rigor. It does not paper over organizational immaturity. It amplifies what’s already there — and when what’s already there is a gap between confidence and competence, the results are predictable. The three bears would demo their chatbot again next week. It would have two wiper motors. And somewhere, a LinkedIn post was already writing itself.

What Actually Went Wrong — And What You Need to Do About It

If you are a stakeholder or decision maker responsible for AI implementation, this is the section you need to read. The story above is entertaining. This part is actionable.

The core truth is simple: AI is not magic. It does not compensate for organizational immaturity, undefined processes, or a failure to understand your own business. If you cannot describe what you need, you cannot build it — not you, not anyone you hire. What follows are the critical areas where discipline is non-negotiable.

Requirements

This comes first because everything else depends on it. Before a single line of code is written, before a model is selected, before anyone mentions OpenAI or anything else, you need well-formed requirements. And when you hand those requirements to an AI-assisted development process, you need to go one step further — prompt the system explicitly to analyze the requirements, ask clarifying questions, and enumerate every assumption it is making in its interpretation. Unstated assumptions in AI systems do not stay hidden. They surface in production, in front of users, at the worst possible moment.

Prompt Engineering

If you are building a chatbot or any LLM-based system, the prompts are not an implementation detail. They are a direct manifestation of your requirements. You need to know where the prompts live, who controls them, who can change them, and how changes are versioned and reviewed. A system whose behavior can be silently altered by editing a prompt that nobody is tracking is a system without change control. Treat prompts with the same rigor you would apply to any other critical configuration. The prompts are your intellectual property.

Testing — Deterministic vs. Non-Deterministic Outputs

This is one of the most misunderstood areas in AI system development, and it is worth being precise about.

LLMs are non-deterministic. You can ask the same question twice and receive different phrasing in return. That is expected and, in many cases, fine. What is not fine is when the underlying facts — the deterministic core of the answer — are wrong. In the Jeep example, the specific make, model, year, and mechanical configuration of that vehicle is a fixed fact. It does not change based on how the question is phrased. That part of the output has to be correct 100% of the time.

This means your testing strategy needs to separate these two layers. The phrasing of an LLM response is generally not what you are testing. What you are testing is whether the factual, deterministic output is accurate. Build your test cases accordingly, and make sure your test team understands the difference. Outcome-based testing — did the system return the correct information, regardless of how it was worded — is the right frame.

Architecture and RAG

Frontier models from OpenAI, Anthropic, and others are trained on vast amounts of general internet data. That breadth is useful, but it is not a substitute for your specific business data. An automotive parts chatbot needs to know your distributor’s catalog, your part numbers, your descriptions — not a generalized understanding of Jeeps scraped from the internet.

This means the team building your system is almost certainly building a layer on top of the frontier model, most likely using a RAG (Retrieval-Augmented Generation) approach to ground responses in your actual data. You need to understand that architecture. What is the retrieval mechanism? How is the knowledge base built and maintained? How is it validated? What happens when the underlying data changes? These are not abstract engineering questions — they directly determine whether your deterministic outputs are accurate. Engineering managers who understand both AI systems and the business need to be evaluating this. Hand-waving is not an answer.

Document Control

Another item that has to be dealt with early in the process of building out an agentic or AI-based system — particularly any system that uses RAG (Retrieval Augmented Generation) — is documentation quality.

Here’s the issue. You’re going to import documents into your system: how-to guides, internal procedures, whatever is relevant to your business. But before you do that, you need a curated, accurate set of documentation. Recent. Vetted. Exactly what you want your system working from. What you don’t want to do is point your AI at a pile of Box folders and say “the documentation’s in there.” That will bite you. It is biting companies right now, in the field.

Once you’ve got your documentation baseline established, you need to document those sources and put a change management process around them. When someone updates a document, someone needs to know that the agentic system has to be updated too. That connection has to be explicit — it can’t be assumed.

Security

Before any AI-powered component touches your website or customer-facing systems, it needs to go through the same security scrutiny you would apply to any other third-party code. That means OWASP scanning, CVE analysis, dependency review, and code that is checked into a repository you can inspect. “Trust us, it works” is not a security posture. You are responsible for what lives in your systems, regardless of who built it.

API Keys and Token Management

If your system uses a frontier model, it uses an API key. That key needs to be yours, stored securely in a secrets manager, with access logged and audited. This matters for two reasons.

First, token burn is a real cost. Every call to the model costs money. Without visibility into usage, you cannot calculate ROI, you cannot forecast costs, and you cannot detect anomalies. Build that visibility in from day one.

Second, if you have handed your API key to a third-party development team, you need to know how it is being used. A vendor using your key for development work on other client projects is burning your budget. It may not be intentional, but it happens. Trust but verify — usage dashboards and spending limits are not optional.

Business Case and ROI

This should exist before the project starts, not after it ships. The structure is straightforward: cost to build, ongoing operational cost including token burn and maintenance, and measurable business outcome — call deflection rate, resolution time, agent cost reduction, whatever the relevant metric is for your use case. Without this baseline, you will have no rational basis for evaluating the system in twelve or twenty-four months when someone asks whether it was worth it.

Additional Vectors Worth Considering

A few areas that often get overlooked in early-stage AI project planning:

Model versioning and drift. Frontier models are updated by their providers. A system that works correctly today may behave differently after a model update. Do you have a process for regression testing when the underlying model changes?

Fallback and escalation paths. What happens when the chatbot cannot answer confidently? Is there a graceful handoff to a human agent, or does the system confidently hallucinate? Fallback behavior needs to be explicitly designed and tested, not left to chance.

Data privacy and compliance. What customer data is being sent to the model provider? Does that comply with your privacy policy, your industry regulations, and your customers’ expectations? This question needs a legal and compliance review, not just an engineering answer.

Ownership of outputs. Who is liable when the chatbot gives a customer wrong information? This is not a hypothetical — a customer who buys the wrong part based on your chatbot’s recommendation has a reasonable grievance. Define accountability before it becomes a customer service problem.

Vendor lock-in. If the entire system is built against one provider’s API with no abstraction layer, switching models later — whether for cost, performance, or compliance reasons — becomes a significant engineering effort. It is worth asking early whether the architecture supports portability.

Monitoring and observability in production. How will you know when the system starts behaving incorrectly after launch? Logging, alerting, and human review of a sample of interactions are not nice-to-haves. They are how you catch problems before your customers do.

None of this is specific to AI. Requirements, testing, security, cost management, and accountability are the fundamentals of building any system responsibly. AI raises the stakes because the failure modes are less obvious and the confidence of the output can mask the inaccuracy of the content.

The three bears had none of this. That is why Goldilocks already knew how the story ended.