llmproductframeworkdevelopment

From Idea to Live LLM Product in 8 Weeks: Our Proven Framework

5th March 2026

8 min read

By Neosharks

Why 8 Weeks?

Not because it's a magic number. Because we've run this process enough times to know that 8 weeks is the right constraint for a first production-grade LLM feature. Too short and you skip the evaluation infrastructure that makes AI products trustworthy. Too long and you're optimising against user data you don't have yet.

This isn't a sprint where you throw a prototype over the wall. It's a disciplined process that produces something real users can rely on — with the measurement infrastructure to improve it post-launch.

Here's the full framework, week by week.

Weeks 1–2: Discovery Sprint

What you're doing

The goal of the discovery sprint is to produce three artefacts before writing a single line of production code:

1. The task definition document. A precise description of what the LLM is being asked to do, what a "good" output looks like, what a "bad" output looks like, and what the failure modes are. This sounds obvious. Most teams skip it. Teams that skip it spend weeks refining prompts without knowing what they're aiming for.

2. The golden dataset. A collection of 200–400 real examples of inputs, paired with expected outputs or output criteria. These come from existing data (support tickets, customer emails, historical examples), user research, or domain expert curation. This is the most important investment of the entire project. Your golden dataset is your source of truth for everything that follows.

3. The success metric. One primary metric that defines whether the product is working. Not a list — one number. Task success rate, measured against your golden dataset, evaluated by human reviewers or an LLM-as-judge configured for your specific criteria.

What you're not doing

You are not writing production code. You are not integrating with any external systems. You are not designing the UI. If you find yourself doing these things in week 1, stop.

The most expensive mistakes in AI product development happen when teams skip the discovery phase and build to the wrong specification.

Weeks 3–4: Architecture and First Prompts

Architecture decisions

With the task definition in hand, you make key architectural decisions:

RAG or fine-tuning? Most first-iteration products use RAG. Fine-tuning is a week 8+ decision based on measured quality gaps.
Which model tier? Start with the model that you think is adequate, not the most powerful one. You can upgrade when you see what's failing, which saves cost during development.
Structured output or free text? If the output feeds downstream systems, design for structured JSON from the start. Retrofitting structured output is painful.
Single call or pipeline? Complex tasks benefit from decomposition into sub-prompts (extract, then reason, then format) rather than one mega-prompt.

Prompt development

Build your first working prompt against 20 examples from your golden dataset. Evaluate manually. Identify failure patterns. Iterate.

The rule of thumb: 10 iterations of prompt refinement before you start building infrastructure. Prompts that work on 80%+ of your golden dataset examples are ready to move forward. Below that, you're building on a shaky foundation.

Document your prompt versions with a simple version number and a note on what changed and why. You'll thank yourself in week 6 when you're debugging a regression.

Week 5: Evaluation Harness

This week is not glamorous. It's also the most important week in the project.

An evaluation harness is an automated system that runs your LLM pipeline against your golden dataset and produces a quality score. It enables:

Measuring whether prompt changes improved or degraded quality
Catching regressions before they reach production
Tracking quality over time as your model or data changes

A minimum viable eval harness:

golden_dataset.json  ← 200+ examples with expected outputs
eval_runner.py       ← runs pipeline against each example
scorer.py            ← computes quality metrics (exact match, 
                       semantic similarity, LLM-as-judge)
results/             ← stored results for each eval run

For most tasks, the scorer has three components:

Deterministic checks (does the output contain required fields? Is it within length limits?)
Semantic similarity against expected outputs using cosine similarity on embeddings
LLM-as-judge for subjective quality dimensions (helpfulness, tone, completeness)

Run your eval harness after every meaningful prompt or architecture change. This takes 10–30 minutes depending on dataset size and is the best quality control investment you'll make.

Target: 85%+ on your primary metric before moving to week 6. If you're below 85%, spend more time on prompt engineering and architecture, not moving forward.

Week 6: Integration and MVP Build

With a working pipeline and eval harness in place, you integrate with the real application:

Connect to actual data sources (databases, document stores, APIs)
Build the UI/UX surface (often simpler than you think — a good AI feature usually has a minimal interface)
Implement fallbacks: what happens when the LLM call fails? When latency exceeds your budget? When the output fails validation?
Add structured logging: every LLM call should log the input, output, latency, model used, and a unique trace ID

Do not skip the fallbacks. Do not skip the logging. These are not optional — they're what separates a production system from a prototype.

Re-run your eval harness against the integrated system. Production integrations introduce bugs. Catch them now.

Week 7: Internal Testing and Quality Review

Deploy to a staging environment accessible to internal users. For two weeks (weeks 7–8 run concurrently for time-efficient teams), collect real usage data.

Internal testing protocol:

Select 10–20 internal users who represent your target persona
Brief them on the feature, but let them use it organically
Instrument every interaction to capture usage patterns
After 3–4 days, hold a structured review session

In this review, you're looking for:

Failure cases your golden dataset didn't cover (add them to the dataset immediately)
Latency complaints (real users have much less patience than developers)
Confusion about what the feature does (a product/UX problem, not an AI problem)
The "wow moments" — specific scenarios where the AI delivered disproportionate value (these become your launch story)

Make targeted improvements based on this data. This is not a major rebuild — it's calibration. If you're finding fundamental problems in week 7, you need to revisit the foundation.

Week 8: Gradual Rollout

Gradual rollout is standard in software engineering and absolutely essential in LLM products. Launch to 5–10% of users first.

Why gradual rollout matters even more for AI: Real user inputs are always more diverse than your golden dataset. You will see failure modes you didn't anticipate. If you launch to 100% of users and there's a systematic failure, you've damaged trust with your entire user base. A 10% rollout limits blast radius.

Monitor closely for the first 72 hours:

Error rate (unexpected outputs, API failures, validation failures)
Quality signal (correction rate, output ratings if you've added a feedback mechanism)
Latency percentiles (p50, p95, p99 — not just average)
User engagement (are users actually using it, or ignoring it?)

Expand rollout to 50% after 72 hours if metrics are acceptable. Full rollout after 1 week if 50% cohort looks good.

Post-Launch: The Week-9-and-Beyond System

The 8-week framework gets you to production. What keeps the product improving:

Weekly eval harness run against your growing golden dataset
Fortnightly human review of 50 randomly sampled production outputs
Monthly quality review meeting with product + engineering + a domain expert
Model upgrade cadence: re-evaluate model choice every quarter as new models release

The golden dataset is a living artefact. Every interesting production failure case gets added to it. After 6 months, your golden dataset is 800+ examples that represent your actual user distribution — a genuine asset that makes every future improvement faster and more reliable.

The Framework in Summary

Week	Focus	Key Output
1–2	Discovery	Task definition, golden dataset, success metric
3–4	Architecture + Prompts	Working pipeline at 85%+ on golden dataset
5	Eval Harness	Automated quality measurement
6	Integration + MVP	Production-integrated system with fallbacks and logging
7	Internal Testing	Calibrated system, expanded golden dataset
8	Gradual Rollout	Live product, 5%→50%→100% with monitoring

Eight weeks is achievable. It requires discipline — specifically, the discipline to not skip the discovery phase, to build the eval harness before the UI, and to resist the urge to launch to 100% of users before you've seen real traffic at scale.

The teams that follow this framework don't just ship faster. They ship products that continue to improve after launch because they built the measurement infrastructure from the start.