Article

From POC to Production: Why Most Enterprise AI Pilots Stall

Parashuram B9 min readProduction AI

Somewhere in your organisation right now there is probably an AI pilot that everyone loved and nobody shipped. We meet a lot of these. The demo went well, the executives leaned in, a screenshot made it into a board deck, and then the whole thing just sat there. Six months later someone asks what happened to it, and the honest answer is that nothing happened to it. It still runs on the laptop of the person who built it.

The instinct at that point is to blame the model. In our experience the model is almost never the problem. The distance between a convincing demo and a system you can run in production is an engineering distance, and teams keep underestimating it because the demo looks like it is ninety percent of the way there. It is closer to twenty.

The demo was never the hard part

A demo runs once, on data someone cleaned by hand the night before, with the builder sitting next to it ready to nudge it back on course. Production runs thousands of times a day, on data that changes underneath you, with nobody watching. Those are two different jobs. The first is a research exercise. The second is software and data engineering, and it is where nearly all of the cost and risk actually live.

There is a social trap in here as well. The demo gets applause, so the demo gets repeated. Another stakeholder wants to see it, then a steering committee, then a regional team. We watched one team demo the same pilot eleven times in a single quarter while the engineering work that would have made it real never started. Nobody decided that on purpose. The calendar just filled up with showings, because a showing is easy to say yes to and plumbing is not.

Foundation models have made this worse, oddly, by making the impressive demo cheap. You can now stand up in a week what would have taken a quarter in 2022, which is exactly why so many organisations have a drawer full of pilots and very little actually running. Getting to the demo proves the idea has value. It says nothing about whether you can deliver that value every day, safely, at a price that survives a budget review.

A quick way to tell you are stuck in pilot mode

If three or four of these sound familiar, your problem is not the model:

The pilot reads from a file someone exports by hand. (If the filename has a v7 in it, that counts double.)
Quality is checked by a person glancing at outputs, not by a test set that runs on every change.
Nobody can say what the system costs per request, or what it actually did last Tuesday.
If the engineer who built it resigned tomorrow, nobody else could safely touch it.

All four of those are engineering gaps. All four are fixable, and none of them need a better model.

Four reasons pilots stall

1. The data foundation is not there

A pilot reads from an exported CSV. Production needs governed, fresh, access-controlled data that arrives on a schedule, from pipelines someone actually maintains. If those pipelines are brittle or quietly wrong, the AI on top inherits every one of their problems and adds a few of its own. A surprising amount of what gets budgeted as "AI work" turns out to be this. You can see how we think about that base layer on our modern data platform page.

2. There is no evaluation or guardrails

In a pilot, a human reads the output and decides it looks right. That is fine in week one and useless at scale. Production needs a real test set, a quality number measured on every change, and guardrails for the cases that actually hurt: the model is unsure, the input is hostile, the answer touches something regulated. Without that, every release is a guess. We have seen a team discover a quality regression from a prompt tweak three weeks after it shipped, because nothing was measuring.

3. There is no observability or cost control

Once a system is live you need to see what it is doing. Traces, drift in the inputs, latency, and the one number that decides whether the project survives, which is cost per outcome. Inference bills are sneaky. A pilot that costs pocket money at ten users can become a six-figure line item at a thousand, and if you cannot break that number down yourself, finance will eventually do it for you.

4. Nobody owns it

A pilot belongs to whoever built it, usually on the side of their actual job. Production needs a named team that owns the system the way they would own any other service: a deploy pipeline, an on-call rotation, a backlog. In assessments we ask who gets paged when this breaks at 2am, and the silence after that question has killed more AI projects than any model limitation we have come across.

Diagram showing the gap between an AI proof of concept and a production system, bridged by four engineering pillars: data foundation, evaluation and guardrails, observability and cost, and CI/CD and ownership.

The work that moves AI from a demo to a system you can run is mostly engineering, not modelling.

What "production" actually means

It helps to define the finish line before you start. Ours is a short checklist, and it is deliberately boring:

Governed data that refreshes on a schedule.
Quality measured automatically on every change, against a real test set.
Guardrails for the unsafe and uncertain cases.
Traces, drift, latency, and cost on a dashboard someone actually looks at.
Deploys go through a pipeline, not a person copying files.
A named team owns it.

Nothing on that list is exciting. It is also the exact difference between the projects that last and the ones that get retired two review cycles later with a polite slide about learnings.

A pragmatic path from POC to production

We do not believe in big-bang AI programmes, partly because we have been called in to rescue a few of them. The path that works is narrower and sounds less impressive in a kickoff deck.

Pick one use case with a number attached. Not "AI for the enterprise," but "cut the time analysts spend drafting this report from two hours to twenty minutes." Then build that one thin slice properly: real data in, governed, evaluated, observable, owned, in front of real users. A thin slice in production teaches you more in a month than a broad pilot does in a year, mostly because real users are ruthless in ways committees are not.

From there you widen. Tune against real usage, add the second use case on the same foundation, and let the platform pay for itself one outcome at a time. This is what we mean by production from day one. The goal is never a demo someone might productionise later. The goal is the smallest real thing, shipped, then the next one.

The difference shows up around week three. A broad pilot in week three is being demoed to yet another committee. A thin slice in production has met real users, surfaced the three data problems nobody mentioned in any workshop, and told you exactly what the next two weeks of work are.

The cost conversation nobody starts early enough

Nearly every AI project has a cost surprise in it somewhere, and nearly nobody goes looking for it until the first real bill lands. The unit that matters is cost per outcome. Not per token, not per call. The all-in cost of one useful result, counting retries, retrieval, and the human review wrapped around it. We have seen a feature that looked like a few cents per request work out to a couple of dollars per genuinely useful answer once everything around it was counted.

Measure that from day one, while the system is small and changing it is cheap. It tells you which use cases deserve to scale and which are clever but will never pay for themselves. It also changes the tone of the budget review, because you arrive holding the number instead of an estimate of the number.

Two disciplines carry this whole path. The data foundation underneath, because everything else sits on it. And treating the AI system as a product, with the same rigour you would apply to any service you run. Our agentic AI and product engineering teams work on exactly that seam, and our conversational AI products are built the same way, with the evaluation and observability in from the start rather than added under pressure later.

Start with the boring parts

If the last few years of enterprise AI have one lesson in them, it is that the boring parts are the project. The model is a component. The system around it is the work.

We have built and run data and AI systems at serious scale, including more than 35 petabytes in production for one of the largest stock exchanges in the world, and none of it stays up because of a clever notebook. It stays up because the foundation, the evaluation, the observability, and the ownership are all there, doing their jobs on the days nobody is paying attention. That is the slightly deflating truth of production AI. It is also the good news. This is engineering, which means it is learnable, you can plan it, and you do not have to wait for a better model to start.

If you have a pilot that impressed everyone and then stalled, the way out is rarely a better model. Book a 30-minute discovery call and we will walk through your most promising pilot and what it would actually take to get it into production.

← All insights