MLOps Guide: Production ML in 2025
Deploy and manage ML models in production. Real deployment patterns, monitoring strategies, and lessons from building 50+ production ML systems.
You built a machine learning model. It works great in your notebook. Accuracy is 94%. You're ready to ship it.
Then reality hits. How do you deploy this? How do you monitor it? What happens when performance drops? How do you update it without breaking everything?
This is where MLOps comes in. MLOps is the practice of deploying, monitoring, and maintaining machine learning models in production. We've set up MLOps pipelines for dozens of companies. Here's what actually works.
What Is MLOps?
MLOps stands for Machine Learning Operations. It's the set of practices that help you:
- Deploy models to production reliably
- Monitor model performance over time
- Retrain and update models safely
- Track model versions and experiments
- Manage the full ML lifecycle
Think of it as DevOps for machine learning. DevOps helps you ship code. MLOps helps you ship models.
The problem: Building a model is maybe 20% of the work. Getting it to production and keeping it working? That's the other 80%.
The solution: MLOps gives you the tools and processes to handle that 80% systematically.
Why MLOps Matters
Most ML projects fail in production. Not because the model is bad. Because the infrastructure around it breaks.
Common failures:
- Models work in development but fail in production
- Performance degrades over time (model drift)
- Updates break existing systems
- No visibility into what's happening
- Can't reproduce results
What MLOps fixes:
- Reliable deployment pipelines
- Continuous monitoring and alerts
- Version control for models and data
- Automated retraining workflows
- Rollback capabilities
We've seen companies lose months of work because they didn't have MLOps. We've also seen teams ship models in days because they did.
The MLOps Lifecycle
MLOps covers the entire lifecycle of a model. From training to retirement.
1. Development
This is where you build and experiment with models. You're in Jupyter notebooks, trying different algorithms, tuning hyperparameters.
Tools:
- Jupyter notebooks
- Experiment tracking (MLflow, Weights & Biases)
- Version control (Git)
- Local development environments
What to track:
- Model code and configurations
- Training data versions
- Hyperparameters
- Metrics and results
- Environment dependencies
2. Training
Once you have a model that works, you need to train it reliably and reproducibly.
Key practices:
- Automated training pipelines
- Data versioning
- Reproducible environments
- Experiment tracking
- Model versioning
Example workflow:
- New data arrives
- Trigger training pipeline
- Train model with tracked parameters
- Evaluate on test set
- Compare to previous models
- Register if better
3. Deployment
Getting your model into production where it can make real predictions.
Deployment patterns:
Batch inference:
- Run predictions on schedule
- Process large datasets
- Lower latency requirements
- Example: Daily customer churn predictions
Real-time inference:
- Predictions on demand
- Low latency required
- API endpoints
- Example: Fraud detection on transactions
Edge deployment:
- Model runs on device
- No network required
- Example: Mobile app recommendations
What you need:
- Model serving infrastructure
- API endpoints
- Load balancing
- Health checks
- Rollback capabilities
4. Monitoring
Once deployed, you need to watch what's happening. Models degrade over time.
What to monitor:
Model performance:
- Prediction accuracy
- Latency and throughput
- Error rates
- Resource usage
Data quality:
- Input data distribution
- Missing values
- Outliers
- Schema changes
Model drift:
- Concept drift (relationships change)
- Data drift (input distribution changes)
- Performance degradation
Infrastructure:
- CPU, memory, disk usage
- API response times
- Error rates
- Request volumes
Example alert: Your fraud detection model's accuracy drops from 94% to 87% over two weeks. You get an alert. You investigate. Turns out the transaction patterns changed. Time to retrain.
5. Retraining
Models need updates. New data arrives. Patterns change. Performance degrades.
When to retrain:
- Scheduled (daily, weekly, monthly)
- Performance drops below threshold
- New data available
- Significant data drift detected
Retraining workflow:
- Trigger retraining (manual or automatic)
- Train new model version
- Evaluate on holdout set
- Compare to current production model
- Deploy if better
- Rollback if worse
Automation: Set up pipelines that retrain automatically when conditions are met. Saves time and keeps models fresh.
6. Retirement
Eventually, models become obsolete. They need to be retired.
When to retire:
- Replaced by better model
- Business requirements changed
- No longer needed
- Too expensive to maintain
Retirement process:
- Stop serving predictions
- Archive model artifacts
- Document retirement reason
- Update monitoring (remove alerts)
- Clean up infrastructure
MLOps Tools and Platforms
You have options. From open-source tools to managed platforms.
Experiment Tracking
MLflow:
- Open-source
- Tracks experiments, models, artifacts
- Model registry
- Deployment tools
- Works with any framework
Weights & Biases (W&B):
- Cloud-based
- Experiment tracking
- Model versioning
- Team collaboration
- Free tier available
Neptune:
- Experiment tracking
- Model registry
- Team collaboration
- Integrates with popular frameworks
Model Serving
TensorFlow Serving:
- Serves TensorFlow models
- High performance
- REST and gRPC APIs
- Version management
TorchServe:
- Serves PyTorch models
- REST APIs
- Model versioning
- Multi-model serving
KServe (formerly KFServing):
- Kubernetes-native
- Supports multiple frameworks
- Auto-scaling
- Canary deployments
Seldon Core:
- Kubernetes-based
- A/B testing
- Multi-armed bandits
- Advanced routing
Managed Platforms
AWS SageMaker:
- End-to-end ML platform
- Training, deployment, monitoring
- Fully managed
- Integrates with AWS services
Google Vertex AI:
- Unified ML platform
- AutoML capabilities
- Model serving
- Monitoring and explainability
Azure Machine Learning:
- Complete ML lifecycle
- MLOps pipelines
- Model registry
- Deployment options
Databricks:
- Unified analytics platform
- MLflow integration
- Model serving
- Feature store
Open-Source MLOps Stacks
Kubeflow:
- Kubernetes-native
- End-to-end pipelines
- Model serving
- Experiment tracking
- Steeper learning curve
MLflow:
- Experiment tracking
- Model registry
- Model serving
- Works with any infrastructure
Prefect / Airflow:
- Workflow orchestration
- Pipeline management
- Scheduling
- Not ML-specific but widely used
Building Your MLOps Pipeline
Here's how we typically set up MLOps for companies.
Step 1: Version Control
Start with version control for everything.
What to version:
- Model code
- Training scripts
- Configuration files
- Data schemas
- Environment files
Tools:
- Git for code
- DVC (Data Version Control) for data
- MLflow for model artifacts
Example structure:
project/
models/
train.py
predict.py
data/
raw/
processed/
config/
training.yaml
serving.yaml
notebooks/
tests/
Step 2: Experiment Tracking
Track all your experiments. You'll thank yourself later.
What to track:
- Hyperparameters
- Metrics (accuracy, F1, etc.)
- Training data version
- Model artifacts
- Environment info
Setup:
import mlflow
mlflow.set_experiment("fraud_detection")
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("epochs", 100)
# Train model
model = train_model(params)
# Log metrics
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("f1_score", 0.91)
# Log model
mlflow.sklearn.log_model(model, "model")
Step 3: Automated Training
Set up pipelines that train models automatically.
Pipeline steps:
- Load and validate data
- Preprocess data
- Train model
- Evaluate model
- Register if better
Example with Prefect:
from prefect import flow, task
@task
def load_data():
# Load training data
return data
@task
def train_model(data):
# Train model
return model
@task
def evaluate_model(model, test_data):
# Evaluate
return metrics
@flow
def training_pipeline():
data = load_data()
model = train_model(data)
metrics = evaluate_model(model, test_data)
if metrics["accuracy"] > threshold:
register_model(model)
Step 4: Model Deployment
Deploy models to serve predictions.
For batch inference:
- Scheduled jobs
- Process data in batches
- Write results to database
For real-time inference:
- API endpoints
- Load balancing
- Auto-scaling
- Health checks
Example API with FastAPI:
from fastapi import FastAPI
import mlflow
app = FastAPI()
model = mlflow.sklearn.load_model("models:/fraud_detection/1")
@app.post("/predict")
def predict(transaction: dict):
prediction = model.predict([transaction])
return {"fraud_probability": prediction[0]}
Step 5: Monitoring
Set up monitoring for everything.
What to monitor:
- Prediction latency
- Error rates
- Model performance (if you have labels)
- Data drift
- Resource usage
Example monitoring:
import time
from prometheus_client import Counter, Histogram
prediction_latency = Histogram('prediction_latency_seconds')
prediction_errors = Counter('prediction_errors_total')
@app.post("/predict")
def predict(transaction: dict):
start = time.time()
try:
prediction = model.predict([transaction])
latency = time.time() - start
prediction_latency.observe(latency)
return {"fraud_probability": prediction[0]}
except Exception as e:
prediction_errors.inc()
raise
Step 6: Automated Retraining
Set up pipelines that retrain models automatically.
Triggers:
- Scheduled (daily, weekly)
- Performance threshold
- Data drift detected
- New data available
Workflow:
- Check if retraining needed
- Train new model
- Evaluate
- Compare to production
- Deploy if better
- Rollback if worse
Common MLOps Patterns
Different use cases need different patterns.
Pattern 1: Simple Batch Pipeline
Use case: Daily predictions on historical data
Setup:
- Scheduled training job
- Batch inference job
- Results to database
- Basic monitoring
Tools:
- Cron or scheduler
- Simple scripts
- Database for results
Example: Daily customer churn predictions. Train weekly. Predict daily. Store results in database.
Pattern 2: Real-Time API
Use case: Low-latency predictions on demand
Setup:
- Model serving API
- Load balancer
- Auto-scaling
- Real-time monitoring
Tools:
- FastAPI or Flask
- Kubernetes or cloud functions
- Monitoring tools
Example: Fraud detection API. Predictions in <100ms. Handles 1000 requests/second.
Pattern 3: A/B Testing
Use case: Testing new models against current
Setup:
- Two model versions
- Traffic splitting
- Performance comparison
- Gradual rollout
Tools:
- Seldon Core
- KServe
- Custom routing
Example: New recommendation model. 10% traffic to new model. Compare metrics. Roll out if better.
Pattern 4: Continuous Training
Use case: Models that retrain automatically
Setup:
- Automated retraining pipeline
- Performance monitoring
- Auto-deployment
- Rollback on failure
Tools:
- MLflow
- Kubeflow
- Managed platforms
Example: Fraud detection model. Retrains weekly. Auto-deploys if better. Alerts on degradation.
Best Practices
What we've learned from 50+ implementations.
1. Start Simple
Don't over-engineer. Start with the basics:
- Version control
- Basic monitoring
- Simple deployment
- Manual retraining
Add complexity as you need it.
2. Monitor Everything
You can't fix what you can't see. Monitor:
- Model performance
- Data quality
- Infrastructure
- Business metrics
3. Automate Gradually
Start manual. Automate what you do repeatedly. Don't automate everything at once.
4. Version Everything
Code, data, models, configs. Version it all. You'll need to reproduce results.
5. Test Before Deploying
Test models like you test code:
- Unit tests for preprocessing
- Integration tests for pipelines
- Performance tests for serving
- A/B tests in production
6. Plan for Rollback
Things break. Have a way to roll back quickly. Keep previous model versions ready.
7. Document Decisions
Why did you choose this model? What were the trade-offs? Document it. Future you will thank you.
8. Start with Batch
Real-time is harder. Start with batch inference. Move to real-time when you need it.
Common Pitfalls
Things that go wrong and how to avoid them.
Pitfall 1: Training-Serving Skew
Problem: Model works in training but fails in production.
Cause: Different data preprocessing, missing features, environment differences.
Solution:
- Use same preprocessing code
- Log inputs and outputs
- Test with production-like data
- Monitor data distributions
Pitfall 2: No Monitoring
Problem: Model performance degrades and you don't know.
Solution:
- Set up monitoring from day one
- Alert on performance drops
- Track data distributions
- Monitor business metrics
Pitfall 3: Manual Everything
Problem: Retraining takes days. Deployments are risky.
Solution:
- Automate training pipelines
- Automate deployments
- Use CI/CD for models
- Test before deploying
Pitfall 4: No Version Control
Problem: Can't reproduce results. Don't know which model is running.
Solution:
- Version code, data, models
- Use model registries
- Tag everything
- Document versions
Pitfall 5: Ignoring Data Quality
Problem: Model fails because input data is bad.
Solution:
- Validate inputs
- Monitor data quality
- Handle missing values
- Check for drift
Real-World Examples
Here's how we've set up MLOps for different companies.
Example 1: E-commerce Recommendations
Requirements:
- Real-time product recommendations
- Update daily with new products
- Handle 10M+ requests/day
Setup:
- Batch training pipeline (daily)
- Real-time serving API
- A/B testing framework
- Performance monitoring
Tools:
- MLflow for tracking
- FastAPI for serving
- Kubernetes for orchestration
- Prometheus for monitoring
Result: Recommendations update daily. API serves 10M+ requests with 50ms average latency. Click-through rate improved 23%.
Example 2: Fraud Detection
Requirements:
- Real-time fraud detection
- Retrain weekly
- High accuracy needed
- Low false positives
Setup:
- Weekly retraining pipeline
- Real-time inference API
- Performance monitoring
- Alert on accuracy drops
Tools:
- MLflow for model registry
- Seldon for serving
- Custom monitoring dashboard
- Automated retraining
Result: Model retrains weekly with zero downtime. Accuracy maintained at 94-96%. False positive rate under 2%.
Example 3: Customer Churn Prediction
Requirements:
- Daily batch predictions
- Monthly retraining
- Integration with CRM
Setup:
- Scheduled training (monthly)
- Batch inference (daily)
- Results to database
- CRM integration
Tools:
- Airflow for scheduling
- Simple Python scripts
- Database for results
- Basic monitoring
Result: Daily predictions for 50K+ customers. Monthly retraining improved accuracy by 8%. Sales team response time cut in half.
Getting Started
Ready to set up MLOps? Here's where to start.
Week 1: Basics
- Set up version control (Git)
- Start tracking experiments (MLflow)
- Document your current process
- Identify what to monitor
Week 2: Deployment
- Deploy model to staging
- Set up basic monitoring
- Test with production-like data
- Plan rollback strategy
Week 3: Automation
- Automate training pipeline
- Set up scheduled retraining
- Automate deployments
- Add more monitoring
Week 4: Optimization
- Review what you've built
- Identify bottlenecks
- Add missing pieces
- Document everything
Tools to Consider
If you're just starting:
- MLflow (experiment tracking)
- FastAPI (serving)
- Basic monitoring (logs, metrics)
If you're scaling:
- Kubeflow or managed platform
- Advanced monitoring (drift detection)
- Feature stores
- Automated pipelines
If you're enterprise:
- Managed platform (SageMaker, Vertex AI)
- Full MLOps platform
- Enterprise features (RBAC, SSO)
- Dedicated team
The Bottom Line
MLOps isn't optional. If you're putting models in production, you need MLOps.
Start simple:
- Version control
- Basic monitoring
- Simple deployment
- Manual retraining
Add complexity as needed:
- Automated pipelines
- Advanced monitoring
- A/B testing
- Feature stores
What matters:
- Reliability
- Visibility
- Reproducibility
- Speed of iteration
Teams with MLOps ship models 10x faster. They catch issues before users do. They iterate weekly instead of quarterly.
Start today. Set up version control and basic monitoring. Add automation next week. Your future self will thank you.