How We Cut AI Inference Costs by 68% Without Sacrificing Quality
The Bill That Changes the Conversation
AI demos are cheap. AI at scale is not.
When a client came to us with a $47,000 monthly OpenAI invoice for a customer support automation system handling 400,000 queries per month, the business case for their AI product was evaporating. The product worked. Users loved it. But at $0.117 per query, the unit economics didn't close.
Over eight weeks of focused optimisation, we brought that figure to $14,800 per month — a 68% reduction — without any meaningful drop in response quality as measured by their existing human evaluation pipeline. Here's exactly what we did, and why each technique worked.
Technique 1: Model Cascading (Saves 35–50% Alone)
Model cascading is the single highest-leverage technique available, and the most underused.
The core insight: not every query needs a frontier model. In our client's support system, analysing query complexity across a random sample of 10,000 queries revealed:
- 42% were simple, FAQ-style questions — product hours, return policies, shipping times. These needed accurate retrieval and clean formatting, not complex reasoning.
- 35% were moderate complexity — multi-step issues requiring context-tracking but not deep inference
- 23% were genuinely complex — edge cases, escalations, multi-product issues, emotional situations
Before optimisation: every query went to GPT-4o. After optimisation:
- Simple queries: GPT-4o-mini (7x cheaper, 95% quality parity on these cases)
- Moderate queries: GPT-4o-mini with enhanced retrieval
- Complex queries: GPT-4o (unchanged quality)
The classifier that routes queries is itself a lightweight GPT-4o-mini call (~150 input tokens) costing fractions of a cent. The net cost reduction from cascading alone was 41%.
Implementation note: Build your cascade classifier on a labelled sample of real queries, not synthetic data. Synthetic data underrepresents the weird edge cases that are expensive to misclassify. A 500-example human-labelled dataset is sufficient for 90%+ routing accuracy.
Technique 2: Semantic Caching (Saves 15–25%)
LLM APIs charge per token, every time, even for questions you've answered a hundred times before. Semantic caching stores previous responses and retrieves them when a new query is semantically similar enough to a cached one.
Unlike exact-match caching, semantic caching handles paraphrases. "What are your store hours?" and "When are you open?" and "Is the shop open on Sunday?" all map to the same cached response.
In our client's system, semantic caching eliminated 22% of API calls entirely. The implementation used:
- A FAISS index of query embeddings (cost: fractions of a cent per embedding via a small embedding model)
- A similarity threshold of 0.92 cosine similarity
- A TTL of 24 hours for time-sensitive responses, 7 days for stable information
Watch out for: setting the similarity threshold too low (serving wrong cached answers) or caching personalised responses (where the same query should yield different results per user). Cache only impersonal, factual, or policy-driven responses.
Technique 3: Prompt Compression (Saves 10–20%)
Prompts are tokens. Tokens are money. Most prompts contain substantial redundancy.
Before optimisation, the system prompt for our client's chatbot was 2,847 tokens — a sprawling document written by committee over six months, full of repetition, examples, and edge cases that rarely applied. After aggressive editing:
- Removed duplicate instructions: -400 tokens
- Converted verbose guidance to concise rules: -600 tokens
- Replaced long few-shot examples with compact format specifications: -500 tokens
Resulting prompt: 1,347 tokens. 53% reduction in system prompt tokens. On a high-volume system, system prompt tokens dominate costs because they repeat on every call.
Beyond manual editing, automated compression tools like LLMLingua can compress prompts by 2–4x with minimal quality loss for many use cases. We've had good results with LLMLingua-2 on RAG context compression, where the retrieved chunks are often verbose and contain irrelevant surrounding text.
For RAG systems specifically: don't retrieve full documents, retrieve targeted passages. Most of our clients retrieve top-5 chunks of 512 tokens each, injecting 2,560 tokens of context per query. Switching to targeted passage extraction (top-3 chunks of 200 tokens, reranked by relevance) cut context tokens by 53% while improving answer accuracy by 8% — less noise, more signal.
Technique 4: Intelligent Batching (Saves 5–15%)
Most LLM APIs support batch processing at significantly lower rates. OpenAI's Batch API charges 50% less than synchronous calls. If your use case tolerates latency (internal data processing, nightly report generation, async content moderation), batching is free money.
For our client, 18% of queries were submitted during off-peak hours by enterprise users running overnight data processing jobs. Moving these to batch processing saved an additional 9% on overall monthly costs with zero product impact.
Even for synchronous workloads, micro-batching within your application tier can reduce per-token overhead when using self-hosted models (vLLM, TGI) by improving GPU utilisation. We've seen throughput improvements of 2–3x on self-hosted Llama 3 70B deployments by tuning batch sizes to match GPU memory constraints.
Technique 5: Right-Sizing to the Task
Not all GPT-4o calls are equal. Some calls were using GPT-4o to perform tasks it dramatically overkills:
- JSON extraction from structured text: regex or a fine-tuned 1B model handles this at 1/100th the cost
- Content moderation: dedicated moderation APIs (OpenAI Moderation, AWS Comprehend) are faster and cheaper than using a general LLM
- Simple classification (positive/negative sentiment, category tagging): fine-tuned small models outperform prompted large models here anyway
Audit your LLM calls. For each one, ask: "What's the minimum model capability that can solve this task at acceptable quality?" Many "LLM calls" don't actually need an LLM — they need a precise tool and the team reached for the general hammer.
Technique 6: Quantisation for Self-Hosted Models
If you're running open-source models on your own infrastructure, quantisation is essential. Moving from FP16 to INT4 quantisation on Llama 3 70B reduces VRAM requirements by roughly 4x, meaning you can run on a single A100 80GB instead of two, cutting hosting costs in half.
Quality impact at INT4: typically less than 2% degradation on benchmarks, and often imperceptible in production. For INT8, quality impact is negligible (less than 0.5% on most benchmarks) with 50% memory savings.
AWQ (Activation-Aware Weight Quantisation) consistently outperforms naive GPTQ quantisation at the same bit-width. For production deployments, it's our default.
Putting It Together: The Cost Audit Framework
Before implementing any of these techniques, run a cost audit:
- Break down costs by call type — which prompts/endpoints are most expensive? (Often 3–4 call types drive 80% of cost)
- Sample and classify queries — what's the actual complexity distribution of real traffic?
- Measure quality baseline — you need a before/after comparison or you're just guessing
- Implement one change at a time — otherwise you can't attribute cost changes to specific interventions
The 68% reduction we achieved wasn't from one silver bullet. It was 41% from cascading + 22% from caching + 14% from prompt compression + 9% from batching, compounding. Each technique required measurement to confirm it was actually working and not degrading quality.
Cost optimisation without quality measurement is just cost-cutting. Cost optimisation with rigorous evaluation is engineering.