The Most Expensive Part of AI Is the Prompt You Didn’t Bother To Write

Feb 24, 2026

The best writing rooms don’t happen in writing rooms. They happen at a bar at 11pm when five opinionated writers are three drinks in, arguing over the same draft. One catches the plot hole. One catches the bad dialogue. One catches the joke that’s technically funny but kills the pacing. Nobody gets the whole picture alone. The final script is better than any single person at the table could have produced.

That’s also the best description I’ve found for how to actually use AI well – except that’s not how most people use it. Most people are sitting at that bar alone, scribbling “make something good” on a cocktail napkin, sliding it to the bartender, and leaving a one-star review when they don’t get the drink they wanted.

I spend more time than I should on Reddit watching people declare AI models are broken. “GPT deleted my entire codebase.” “Claude refused to do what I asked.” “Gemini hallucinated a lawsuit that doesn’t exist.” And every time, buried about four comments deep, someone asks the question nobody wants to answer: “What was your prompt?”

*crickets*

There’s a phrase people love to say when they want to sound pragmatic about AI: garbage in, garbage out. It’s old. It’s unsexy. It’s also the closest thing we have to a law of physics in this space. The part that keeps surprising me is how many “AI power users” repeat that phrase while continuing to feed models prompts that look like they were typed with one thumb at a red light.

And I’m not talking about casual use. I’m talking about high-stakes work – code review gates, implementation checkpoints, security-sensitive pathways, decisions that either prevent damage… or politely approve it. Every vague prompt that produces bad code becomes tech debt you won’t discover until it bites you in production six months from now. Every contradictory instruction that forces a rerun – and then a re-rerun – burns tokens, burns hours, and burns the subscription you’re paying for whether the output is useful or not.

Over the last two weeks, I took on a full foundation refactor of Artwell, the Story Engine platform I’m building. Fourteen phases, dozens of implementation slices, fourteen days straight – the kind of rebuild where you either get the architecture right or you get to do it again in six months. Because the stakes were high and the surface area was enormous, I decided to run three frontier models as a review panel across the entire process, from the initial planning review through every meaningful checkpoint along the way. Not a vibe check. Not cherry-picked screenshots. A repeated, evidence-backed panel as the work moved from plan to implementation to remediation. Dozens of verdicts. Reruns when blockers were disputed. Retrospective tracking when reality proved who was right and who was merely confident.

What fell out of that process wasn’t just a ranking. It was a sharper view of what most “model evaluations” actually are: prompt evaluations in disguise. When the instructions were coherent, the panel looked brilliant. When the instructions contradicted themselves, even the best models began behaving like perfectly obedient interns following a bad spec – faithfully, catastrophically, and without apology.

One of the cleanest examples was also one of the most banal: a prompt pack that required mandatory reads while simultaneously enforcing contamination rules that prohibited reading them. The model didn’t hallucinate. It didn’t get confused. It did exactly what it was told to do – and then hard-stopped on a procedural blocker because the human who wrote the instructions created an impossible situation.

That should feel familiar to anyone who’s ever handed someone two conflicting priorities and then acted surprised when they froze. Tell your direct report “move fast” and “don’t break anything” without defining the boundary, and watch what happens. The only difference with AI is that the failure happens in language first – which makes it easier to dismiss as “model weirdness” instead of what it actually is: governance debt.

The Other Kind of Failure

There’s also a second prompt failure mode that’s almost the opposite of contradiction, and it’s sneakier because it produces beautiful green dashboards.

Prompts that are too strict. Too narrow. Too “paint-by-numbers.”

When the guardrails are a millimeter apart, models will stay perfectly between them, report everything is green, and never once mention the cliff right next to the track. If your prompt only asks, “Does the existing checklist pass,” the model can honestly answer “yes” while completely missing that the checklist itself failed to ask the right questions. That’s not the model being dumb – that’s the model being obedient.

I’ve seen this happen in a particularly irritating way: you can almost watch the model notice something odd during its reasoning – a strange invariant, a suspicious edge path, a “wait, why is that allowed?” moment – and then omit it from the final report because the prompt never explicitly gave permission to include “extra” observations. The model isn’t hiding it. It’s following your instructions like a watchdog you accidentally trained to wag at intruders.

And the missed defect doesn’t vanish. It ships. It hides in your codebase, passing every check you built, until some future Tuesday when it surfaces as a production incident and you’re suddenly paying emergency rates to fix something a better prompt would have caught for free.

So What’s the Right Balance?

Prompt quality isn’t just “be comprehensive.” It’s “aim for the right balance.”

Some reviews should be surgical. You want a model to prove a specific invariant and treat anything else as noise. That’s the right move when you’re validating a security promise, or locking a contract, or verifying a remediation. In those moments, freedom is the enemy and precision is the product.

Other reviews benefit from breathing room – permission to look outside the planned lane, to ask “what changed during implementation,” to flag weirdness even if it’s not on the checklist, and to challenge assumptions the planning stage might have missed. That’s how you catch the things that slip through, not because anyone was careless, but because real systems evolve in ways planning documents can’t perfectly predict.

And this is where the models come in – because each one failed in ways that illuminate a different kind of prompt problem.

The Review Panel

Across roughly thirty tracked review cycles – planning reviews, checkpoint gates, reruns, the whole messy parade – the overall effectiveness ranking came out as:

GPT-5.3-codex-xhigh
GPT-5.2-xhigh
Claude Opus-4.6

If you stop at the ranking, you’ll miss what matters, because the story is really about failure patterns and role-fit.

GPT-5.3 was the most reliable sentinel – fast, cross-stack realistic, and consistently first to catch the quiet defects that make a system appear to work until it doesn’t. Its weakness was the cleanest illustration of the prompt contradiction problem: give it conflicting instructions and it will hard-stop every time. You wrote an unsatisfiable contract and then blamed the interpreter.

GPT-5.2 is the deep defect hunter – the model that obsesses over control flow, schema validation, and security leakage at the level where things look fine from a distance and fail spectacularly up close. It produced some of the highest-value unique catches in the entire dataset, finding blocker-class bugs the other two models cleared without hesitation. I’ll be honest: I spent weeks hoping the data would let me drop this model from the panel just to speed things up. Its response times could run three to five times longer than the others. The data didn’t cooperate. The catches were too important.

Opus-4.6 is the strongest architecture and process synthesizer in the group – the one you want reading the whole packet and telling you whether the system makes sense as a system, not just whether a diff compiles. Where it fell short was first-pass blocker severity. It can be, in a very human way, too reasonable. It gives the benefit of the doubt. That’s a great trait in a colleague. It’s a risky trait in a gatekeeper. In several cycles, Claude said “Ready” while the other models caught real, later-validated bugs. The prompt gave it permission to pass, and it took it.

Every model had moments of clarity and brilliance. Every model had moments of complete failure. And the failures didn’t follow a simple pattern – the one that caught a critical security gap in one cycle missed a durability blocker in another.

Why the Answer Isn’t “Pick the Best One”

This is why the best approach wasn’t mono-model. It was a panel with roles.

The strongest configuration was GPT-5.3 and GPT-5.2 as the primary technical gate pair – speed and durability coverage on one side, adversarial depth on the other – with Claude as the coherence and verification seat that forces the work to make sense and confirms closure is real. Not just different prompts. Different brains. These are fundamentally different LLMs trained on different data, and part of the value is that they genuinely see different things – the way three writers at that bar will catch three different problems in the same draft because they read with different instincts.

Think of it as separation of concerns applied to judgment. In a writing room, the person who spots plot holes isn’t always the person who fixes pacing – and you don’t force them to be. Detection and validation are different cognitive tasks. You wouldn’t ask the same engineer to both find the bugs and certify the release. You shouldn’t ask the same model to do both, either.

And the prompt design for each seat was different. The sentinel seat got tight, specific instructions: prove this invariant, flag this class of defect, stop on this condition. The depth seat got wider scope: look for what the checklist missed, challenge assumptions, interrogate edge paths. The synthesis seat got a holistic mandate: does this system make sense end-to-end, and is closure real?

When I matched the prompt to the role and the role to the model’s strengths, the panel was remarkably effective. When I used the same prompt for everyone, I got noise from the fast reviewer, missed defects from the synthesizer, and procedural gridlock from the deep hunter. Same models. Different outcomes. The variable was the prompt.

The Part People Keep Skipping

And that brings me back to the part people keep skipping because it feels annoyingly “process-y”:

Writing a good prompt is not typing. It’s engineering.

It requires you to know what you actually want, resolve contradictions before you outsource reasoning, and decide whether you’re trying to validate a known checklist or invite the model to challenge the checklist itself. It forces you to define what counts as a blocker and what counts as follow-up. It exposes gaps in your own thinking because the model will walk exactly where you pointed – and if you pointed into fog, it will give you fog-shaped certainty right back.

The next year of AI won’t be won by the teams with the fanciest model access. Everyone has access now. The advantage is shifting to teams who build systems around models – workflows that treat prompting as an engineering artifact, panels that separate detection from validation, and prompt packs that don’t contain landmines or blinders.

In other words, teams who stop treating prompts like a text field and start treating them like a spec.

I started this process thinking I was evaluating models. Two weeks later, staring at thirty cycles of evidence, I realized the models had been evaluating me the whole time. Every false alarm traced back to a contradictory instruction I wrote. Every missed defect traced back to a prompt that wasn’t brave enough to ask the real question. Every time the panel disagreed, it was because I’d given three different brains permission to read the same instructions three different ways.

Garbage in, garbage out. Turns out it’s not just a law for machines.

Artwell

Discussion about this post

Ready for more?