MLOps for Startups: What You Actually Need (And What You Don't)
The MLOps Over-Engineering Trap
Every ML engineer who has worked at a large tech company arrives at a startup with a mental model of what "proper" ML infrastructure looks like: feature stores, model registries, automated retraining pipelines, A/B testing frameworks, data versioning systems, model monitoring dashboards, shadow deployment infrastructure, and a Kubernetes cluster to orchestrate all of it.
At Google, this infrastructure makes sense. At a 15-person startup with 500 daily active users and one ML model in production, it's a six-month engineering project that will consume your entire team before you've found product-market fit.
The MLOps over-engineering trap is real, and it's expensive. We've seen startups spend $200,000+ in engineering time building infrastructure for problems they don't have yet, while the actual product stagnates.
Here is the honest guide to what you need, what you can defer, and when to invest in each layer.
Stage 1: Pre-PMF (0–1 Models in Production)
If you're before product-market fit, your MLOps infrastructure should fit on a napkin. Seriously.
What you need:
- A way to version your prompts (a simple text file in Git is fine)
- Basic request/response logging to a database or log aggregator
- A spreadsheet or simple script to run your golden dataset and measure quality
- Error alerting (a Slack webhook that fires when your error rate exceeds a threshold)
What you don't need:
- A feature store (you don't have features, you have prompts)
- A model registry (you're using third-party models via API)
- Automated retraining pipelines (nothing to retrain yet)
- Kubernetes (a single VM or serverless function handles your load)
- MLflow or any experiment tracking tool (a structured log file is sufficient)
The total engineering time for this stack: 2–3 days. The cost: nearly zero (logging to PostgreSQL + PagerDuty or similar for alerting).
The most important thing at this stage: logging. Log every LLM input and output, with a timestamp, user identifier, latency, and model version. This data is the foundation for every decision you'll make later. Teams that don't log from day one spend weeks trying to reconstruct what actually happened in production.
Stage 2: Early Growth (2–5 Models, Post-PMF)
Once you have product-market fit, real users, and multiple AI features, the picture changes. You now have problems worth solving with infrastructure.
What you need:
Prompt version management — more than Git. You need to track which prompt version is serving which traffic, be able to roll back instantly, and run A/B tests between prompt versions with quality measurement on each. A simple internal tool (100–200 lines of Python) or a light tool like PromptLayer is sufficient. This is not the same as a full ML experiment platform.
Structured observability — beyond basic logging. You need to be able to query: "Show me all calls where latency exceeded 5 seconds in the last 24 hours, grouped by model version and user segment." Tools like Langfuse or a custom Grafana dashboard over your log database work well. Total build time: 3–5 days. Ongoing cost: $50–200/month.
Automated eval runs — your golden dataset should now run automatically on a schedule (nightly or after every deployment), with results stored in a database and alerts if quality drops below thresholds. This takes 1–2 days to build if you already have a golden dataset.
A/B testing framework — simple feature flagging (LaunchDarkly, or a homemade version) that routes a percentage of traffic to a new prompt/model version and measures quality differential. This can be as simple as a random assignment stored in your users table.
What you still don't need:
- A feature store (still using prompts, not features)
- Custom model training infrastructure
- Complex orchestration systems
- Dedicated MLOps tooling vendors ($2,000–10,000/month)
The gap between Stage 1 and Stage 2 infrastructure is about 2–3 weeks of engineering time. Spread over a quarter, this is very manageable.
Stage 3: Scale (5+ Models, High Volume)
At meaningful scale — say, 1M+ LLM calls per month, multiple models in production, and an ML team of 3+ — the more sophisticated tooling starts paying for itself.
What to invest in:
Model monitoring — drift detection, anomaly alerts, quality degradation detection. Real-time dashboards that flag when output quality for a specific user segment or input category degrades. Arize, WhyLabs, or a custom solution. At this stage, the cost of a production quality failure exceeds the cost of the monitoring tool.
Experiment tracking — MLflow or Weights & Biases for tracking evaluation runs, prompt experiments, and fine-tuning jobs systematically. The value emerges when you have 50+ experiments and need to understand what actually caused improvements.
Cost attribution — at 1M+ calls/month, you need per-feature, per-customer, per-model-tier cost breakdown. This isn't a vendor tool — it's instrumentation in your application layer that tags every LLM call with a cost-centre identifier.
Data versioning — if you're accumulating production feedback data and using it for fine-tuning, you need to version your datasets the same way you version code. DVC (Data Version Control) is the standard tool here.
When to self-host models — at approximately 10M tokens per day, the economics of self-hosting open-source models (Llama 3, Mistral, etc.) on cloud GPUs typically become competitive with API costs. Before that threshold, API costs are lower when you factor in engineering time, reliability engineering, and infrastructure management.
The Observability Essentials (Non-Negotiable at Every Stage)
Regardless of where you are in the stages above, these observability elements are non-negotiable from day one:
Input/output logging — every LLM call, timestamped, with user context. The retention period should be at least 90 days.
Latency percentiles — p50, p95, p99 by model and feature. Not just averages — averages hide the tail latency that ruins user experience.
Error rate by category — API errors, validation failures, timeout rates. Tracked over time so you can see trends, not just point-in-time snapshots.
Cost tracking — total spend by model, by feature, by time period. If you don't measure it, you can't manage it.
User feedback signal — even a simple thumbs up/down on AI outputs is enormously valuable. Capture it from day one. Even 5% of users providing explicit feedback generates hundreds of labelled examples per month on a moderately active product.
These five observability elements can be implemented in a day using any standard logging infrastructure you already have. There is no excuse for not having them.
The Common Over-Engineering Mistakes
Building a feature store for LLM applications. Feature stores are designed for classical ML that requires tabular features. For LLM applications, your "features" are your prompts and retrieved documents. A feature store adds complexity with no benefit.
Automated retraining pipelines before you have training data. Retraining infrastructure is only valuable if you have a labelled dataset large enough to train on, a quality metric that improves with more data, and the engineering capacity to validate and deploy new model versions safely. Most startups don't have all three until they're well past 100,000 users.
Shadow deployment infrastructure. Running a new model version in shadow mode (processing all traffic but discarding outputs) is useful at scale for high-stakes models. For most startups, a staged rollout (5%→25%→50%→100%) with quality monitoring provides equivalent safety at a fraction of the complexity.
Kubernetes before you need it. The operational overhead of Kubernetes for a team of fewer than 10 engineers is substantial. Serverless (Lambda, Cloud Run) or a single well-provisioned VM handles the load of most early-stage AI products. Migrate to Kubernetes when autoscaling and deployment automation become genuine bottlenecks, not before.
The Practical Path
Here is the recommended progression:
Month 1: Log everything. Build your golden dataset. Wire up basic error alerting. Total engineering time: 3–5 days.
Month 3: Add structured observability, automated eval runs, and simple A/B testing. Total additional engineering time: 2 weeks.
Month 6–9: Based on actual bottlenecks you've observed, selectively add the Stage 2 tooling that addresses your specific pain points. Do not add tooling speculatively.
Year 2+: With real scale, real data, and a real ML team, the Stage 3 investments pay for themselves and the build decisions become obvious.
The startup MLOps landscape is full of vendors trying to sell you Stage 3 tooling when you're at Stage 1. Almost all of them have compelling demos. Almost none of them are the right investment for a pre-PMF company.
Your competitive advantage is not your MLOps infrastructure. It's the quality of your model outputs, your feedback loop with users, and your product intuition. Build the infrastructure that serves those goals — and nothing more until you need it.