Why most ML models never make it to production

A telecom company we worked with had attempted to ship a churn prediction model twice in 18 months. Both attempts ended the same way: the model performed well in evaluation, stakeholders approved it, and then it quietly stalled during productionisation and was never deployed. By the time we got involved, the data science team had lost confidence that they could ever get a model into production.

The models were not the problem. Both were technically sound, with solid evaluation metrics and clear business value. The problem was everything around the models — the infrastructure, the operations, the monitoring, the handoff to the team that would run it.

The notebook-to-production gap is real and underestimated

Data scientists are trained to build models. They are rarely trained to operate them. And in most organisations, the path from a notebook that works to a service that runs reliably in production involves a set of engineering disciplines that are quite different from the skills needed to build the model in the first place:

Feature pipeline engineering — reliably producing the same features at training time and serving time
Model serving infrastructure — an API or batch job that scores new data on a schedule
Monitoring — detecting when the model's predictions start degrading before the business feels it
Retraining triggers — knowing when and how to retrain, and how to validate the new version before replacing the old one
Rollback procedures — how to revert to a previous model version if something goes wrong
Operational documentation — runbooks so the team supporting the system doesn't need to call the data scientist at 11pm

None of these are model quality questions. All of them are engineering and operations questions. And all of them need to be answered before a model can be considered production-ready.

A model that scores 0.91 AUC in evaluation and has no monitoring in production is less useful than a model that scores 0.84 AUC and has a monitoring dashboard, a retraining cadence, and a runbook. Operational readiness matters more than marginal model quality.

The training-serving skew problem

The most technically costly failure mode in ML production is training-serving skew: the features used to train the model are computed differently from the features used to score new data. This happens because data scientists compute features in notebooks using historical data, and the serving pipeline computes the same features in a different environment, potentially with different logic, different data sources, or different handling of edge cases.

The model "worked" in evaluation because evaluation used the same notebook-computed features as training. In production, the serving pipeline computes different feature values, and the model makes worse predictions — sometimes dramatically worse.

The solution is a feature store or, at minimum, a single feature computation definition that is used at both training and serving time. The feature logic should exist once, in version control, and both the training pipeline and the serving pipeline should call the same code.

What production-ready actually means

Before we deploy any model with a client, we run it through a production-readiness checklist. Every item on this list needs to be answered before the model goes live:

Feature pipeline — does a scheduled, tested pipeline produce fresh features on the cadence the model requires?
Training-serving parity — is the feature computation logic identical at training and serving time? Can you prove it?
Scoring infrastructure — is there a serving endpoint or batch job that runs on schedule, with retry logic and failure alerting?
Offline evaluation — has the model been evaluated on a held-out dataset using the same features the serving pipeline produces?
Shadow mode — has the model run in parallel with the current process, with outputs compared but not acted on, for at least two weeks?
Monitoring dashboard — is there a dashboard showing prediction distribution, feature drift, and business outcome metrics?
Drift alerts — are there automated alerts for when prediction distribution drifts beyond a defined threshold?
Retraining trigger — is there a defined criterion for when to retrain, and a process for doing so?
Rollback procedure — is there a documented, tested way to revert to the previous model version in under 30 minutes?
Operational runbook — is there a document that tells the on-call engineer what to do when the scoring pipeline fails at 2am?

In the two failed churn model attempts at the telecom company, none of these had been addressed. The data science team had focused entirely on model quality. The production engineering work had been assumed to be someone else's responsibility — but nobody had been assigned to it.

Starting earlier on the operational questions

The most effective change we made in that engagement was to front-load the operational questions. In the first week, before any model training, we asked: how will this model be scored? On what schedule? By what system? Who will be on call when it breaks? What does a monitoring dashboard need to show for a business stakeholder to feel confident?

These questions forced conversations that would otherwise happen at the end — when it was too late to change the architecture to support them. We designed the feature pipeline before we designed the model. We defined the monitoring requirements before we chose the evaluation metrics. We wrote the rollback procedure before we wrote the training code.

The churn model shipped on the third attempt, six weeks after we started. It has been running in production for eight months. The data science team now runs the same pre-productionisation checklist on every new model before they present it to stakeholders.

The organisational fix

The technical solutions — feature stores, model registries, monitoring platforms — are well-understood. The harder problem is organisational: who is responsible for the gap between a model notebook and a production service?

In organisations where this works well, the answer is clear. There is a team or a function — sometimes called ML engineering, sometimes data engineering, sometimes platform — that owns the production infrastructure and treats productionising models as a first-class engineering discipline. The data science team is responsible for the model quality. The ML engineering function is responsible for everything else.

In organisations where ML production consistently fails, the responsibility for productionisation is unclear, assumed, or nobody's formal job. The data scientists are expected to do it but not resourced to do it well. Or the data engineering team is expected to do it but was never consulted during the model design phase.

The fix is structural. Assign explicit ownership. Define the handoff point. Make operational readiness a gate, not an afterthought.

Discuss an AI project Applied AI services