From Idea to Live LLM Product in 8 Weeks: Our Proven Framework
Why 8 Weeks?
Not because it's a magic number. Because we've run this process enough times to know that 8 weeks is the right constraint for a first production-grade LLM feature. Too short and you skip the evaluation infrastructure that makes AI products trustworthy. Too long and you're optimising against user data you don't have yet.
This isn't a sprint where you throw a prototype over the wall. It's a disciplined process that produces something real users can rely on — with the measurement infrastructure to improve it post-launch.
Here's the full framework, week by week.
Weeks 1–2: Discovery Sprint
What you're doing
The goal of the discovery sprint is to produce three artefacts before writing a single line of production code:
1. The task definition document. A precise description of what the LLM is being asked to do, what a "good" output looks like, what a "bad" output looks like, and what the failure modes are. This sounds obvious. Most teams skip it. Teams that skip it spend weeks refining prompts without knowing what they're aiming for.
2. The golden dataset. A collection of 200–400 real examples of inputs, paired with expected outputs or output criteria. These come from existing data (support tickets, customer emails, historical examples), user research, or domain expert curation. This is the most important investment of the entire project. Your golden dataset is your source of truth for everything that follows.
3. The success metric. One primary metric that defines whether the product is working. Not a list — one number. Task success rate, measured against your golden dataset, evaluated by human reviewers or an LLM-as-judge configured for your specific criteria.
What you're not doing
You are not writing production code. You are not integrating with any external systems. You are not designing the UI. If you find yourself doing these things in week 1, stop.
The most expensive mistakes in AI product development happen when teams skip the discovery phase and build to the wrong specification.
Weeks 3–4: Architecture and First Prompts
Architecture decisions
With the task definition in hand, you make key architectural decisions:
- RAG or fine-tuning? Most first-iteration products use RAG. Fine-tuning is a week 8+ decision based on measured quality gaps.
- Which model tier? Start with the model that you think is adequate, not the most powerful one. You can upgrade when you see what's failing, which saves cost during development.
- Structured output or free text? If the output feeds downstream systems, design for structured JSON from the start. Retrofitting structured output is painful.
- Single call or pipeline? Complex tasks benefit from decomposition into sub-prompts (extract, then reason, then format) rather than one mega-prompt.
Prompt development
Build your first working prompt against 20 examples from your golden dataset. Evaluate manually. Identify failure patterns. Iterate.
The rule of thumb: 10 iterations of prompt refinement before you start building infrastructure. Prompts that work on 80%+ of your golden dataset examples are ready to move forward. Below that, you're building on a shaky foundation.
Document your prompt versions with a simple version number and a note on what changed and why. You'll thank yourself in week 6 when you're debugging a regression.
Week 5: Evaluation Harness
This week is not glamorous. It's also the most important week in the project.
An evaluation harness is an automated system that runs your LLM pipeline against your golden dataset and produces a quality score. It enables:
- Measuring whether prompt changes improved or degraded quality
- Catching regressions before they reach production
- Tracking quality over time as your model or data changes
A minimum viable eval harness:
golden_dataset.json ← 200+ examples with expected outputs
eval_runner.py ← runs pipeline against each example
scorer.py ← computes quality metrics (exact match,
semantic similarity, LLM-as-judge)
results/ ← stored results for each eval run
For most tasks, the scorer has three components:
- Deterministic checks (does the output contain required fields? Is it within length limits?)
- Semantic similarity against expected outputs using cosine similarity on embeddings
- LLM-as-judge for subjective quality dimensions (helpfulness, tone, completeness)
Run your eval harness after every meaningful prompt or architecture change. This takes 10–30 minutes depending on dataset size and is the best quality control investment you'll make.
Target: 85%+ on your primary metric before moving to week 6. If you're below 85%, spend more time on prompt engineering and architecture, not moving forward.
Week 6: Integration and MVP Build
With a working pipeline and eval harness in place, you integrate with the real application:
- Connect to actual data sources (databases, document stores, APIs)
- Build the UI/UX surface (often simpler than you think — a good AI feature usually has a minimal interface)
- Implement fallbacks: what happens when the LLM call fails? When latency exceeds your budget? When the output fails validation?
- Add structured logging: every LLM call should log the input, output, latency, model used, and a unique trace ID
Do not skip the fallbacks. Do not skip the logging. These are not optional — they're what separates a production system from a prototype.
Re-run your eval harness against the integrated system. Production integrations introduce bugs. Catch them now.
Week 7: Internal Testing and Quality Review
Deploy to a staging environment accessible to internal users. For two weeks (weeks 7–8 run concurrently for time-efficient teams), collect real usage data.
Internal testing protocol:
- Select 10–20 internal users who represent your target persona
- Brief them on the feature, but let them use it organically
- Instrument every interaction to capture usage patterns
- After 3–4 days, hold a structured review session
In this review, you're looking for:
- Failure cases your golden dataset didn't cover (add them to the dataset immediately)
- Latency complaints (real users have much less patience than developers)
- Confusion about what the feature does (a product/UX problem, not an AI problem)
- The "wow moments" — specific scenarios where the AI delivered disproportionate value (these become your launch story)
Make targeted improvements based on this data. This is not a major rebuild — it's calibration. If you're finding fundamental problems in week 7, you need to revisit the foundation.
Week 8: Gradual Rollout
Gradual rollout is standard in software engineering and absolutely essential in LLM products. Launch to 5–10% of users first.
Why gradual rollout matters even more for AI: Real user inputs are always more diverse than your golden dataset. You will see failure modes you didn't anticipate. If you launch to 100% of users and there's a systematic failure, you've damaged trust with your entire user base. A 10% rollout limits blast radius.
Monitor closely for the first 72 hours:
- Error rate (unexpected outputs, API failures, validation failures)
- Quality signal (correction rate, output ratings if you've added a feedback mechanism)
- Latency percentiles (p50, p95, p99 — not just average)
- User engagement (are users actually using it, or ignoring it?)
Expand rollout to 50% after 72 hours if metrics are acceptable. Full rollout after 1 week if 50% cohort looks good.
Post-Launch: The Week-9-and-Beyond System
The 8-week framework gets you to production. What keeps the product improving:
- Weekly eval harness run against your growing golden dataset
- Fortnightly human review of 50 randomly sampled production outputs
- Monthly quality review meeting with product + engineering + a domain expert
- Model upgrade cadence: re-evaluate model choice every quarter as new models release
The golden dataset is a living artefact. Every interesting production failure case gets added to it. After 6 months, your golden dataset is 800+ examples that represent your actual user distribution — a genuine asset that makes every future improvement faster and more reliable.
The Framework in Summary
| Week | Focus | Key Output |
|---|---|---|
| 1–2 | Discovery | Task definition, golden dataset, success metric |
| 3–4 | Architecture + Prompts | Working pipeline at 85%+ on golden dataset |
| 5 | Eval Harness | Automated quality measurement |
| 6 | Integration + MVP | Production-integrated system with fallbacks and logging |
| 7 | Internal Testing | Calibrated system, expanded golden dataset |
| 8 | Gradual Rollout | Live product, 5%→50%→100% with monitoring |
Eight weeks is achievable. It requires discipline — specifically, the discipline to not skip the discovery phase, to build the eval harness before the UI, and to resist the urge to launch to 100% of users before you've seen real traffic at scale.
The teams that follow this framework don't just ship faster. They ship products that continue to improve after launch because they built the measurement infrastructure from the start.