LLM Cost Optimization in 2026: Routing, Caching, and Batching
Cost is a product constraint. A practical playbook for reducing LLM spend 47-80% without degrading UX: route smart, cache strategically, and batch tool work.
TL;DR
- Cost is a product constraint that affects pricing, margin, and scale
- Prompt caching can reduce API costs by 45-80% and improve time-to-first-token by 13-31%
- Semantic caching + budget-aware routing achieves 47% spend reduction in production
- Route easy tasks to cheap paths; reserve heavy models for hard tasks
- Cache anything deterministic and frequently repeated
- Batch tool calls and retrieval queries to reduce overhead
- Beyond tokens: account for data storage, retrieval, and infrastructure costs
Why Cost Optimization Matters
LLM costs affect more than your cloud bill:
| Impact | Consequence |
|---|---|
| Product pricing | High costs = higher prices = smaller market |
| Margin | Thin margins limit growth investment |
| Scale | 10x users = 10x cost without optimization |
| Feature decisions | Expensive features don’t ship |
| Competitive position | Cheaper competitors win on price |
The Real Cost Structure
Beyond token costs, production LLM systems have hidden expenses:
| Cost Category | Examples |
|---|---|
| Model inference | API calls, per-token pricing |
| Retrieval | Vector DB queries, embedding generation |
| Storage | Conversation logs, embeddings, caches |
| Compute | Preprocessing, postprocessing, orchestration |
| Infrastructure | Load balancing, monitoring, failover |
Holistic cost optimization addresses all of these, not just token spend.
The Three Pillars of Cost Optimization
Pillar 1: Routing
Direct each request to the most cost-effective path that meets quality requirements.
Pillar 2: Caching
Store and reuse results for repeated or similar requests.
Pillar 3: Batching
Combine multiple operations to reduce overhead and improve efficiency.
The combination of all three achieves the best results — 47-80% cost reduction in production systems.
Routing: Cheap by Default, Strong by Exception
The goal isn’t “always use the best model” — it’s best outcome per dollar.
The Routing Decision
For each request, determine:
| Question | Routing Implication |
|---|---|
| Is this request simple? | Use cheaper, faster model |
| Does it need tools? | Route to tool-capable model |
| Does it need long context? | Route to large-context model |
| Is quality critical? | Route to best model |
| Is latency critical? | Route to fastest model |
Routing Architecture
Incoming Request
↓
┌────────────────────────────┐
│ Intent Classifier │
│ (cheap model or rules) │
└────────────────────────────┘
↓
┌────────────────────────────┐
│ Complexity Scorer │
│ (simple/medium/complex) │
└────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ Model Router │
├──────────┬──────────────┬──────────────────┤
│ Simple │ Medium │ Complex │
│ GPT-4o- │ GPT-4o │ Claude Opus │
│ mini │ │ GPT-4o+ │
└──────────┴──────────────┴──────────────────┘
↓
Response with quality check
↓
Fallback to higher tier if needed
Routing Tiers
| Tier | Model Class | Use Cases | Cost |
|---|---|---|---|
| Tier 1 | Small/fast (GPT-4o-mini) | Simple Q&A, classification, formatting | $$ |
| Tier 2 | Standard (GPT-4o, Claude 3.5) | General tasks, moderate complexity | $$$ |
| Tier 3 | Premium (GPT-4o+, Claude Opus) | Complex reasoning, critical tasks | $$$$ |
Routing Rules Example
def route_request(request: Request) -> str:
# Simple classification tasks
if request.task_type == "classify":
return "gpt-4o-mini"
# Long context needs specific model
if request.token_count > 32000:
return "gpt-4o-128k"
# Complex reasoning
if request.complexity_score > 0.8:
return "claude-opus"
# Tool-heavy workflows
if request.requires_tools:
return "gpt-4o" # Good tool performance
# Default to cost-effective
return "gpt-4o-mini"
Quality Fallback
Cheap models sometimes fail. Build in fallback:
Request → Tier 1 model
↓
Quality check
↓
Pass? → Return response
↓
Fail? → Retry with Tier 2
↓
Still fail? → Tier 3
Track fallback rates to optimize routing rules.
Caching: The Most Underused Lever
Caching is often the highest-ROI optimization. Research shows:
| Benefit | Impact |
|---|---|
| Cost reduction | 45-80% with strategic caching |
| Latency improvement | 13-31% faster time-to-first-token |
| Consistency | Same inputs = same outputs |
What to Cache
| Cache Target | When to Cache | Cache Duration |
|---|---|---|
| Embeddings | Always for repeated documents | Long (until content changes) |
| Retrieval results | For stable documents | Medium (hours to days) |
| Tool outputs | When output doesn’t change quickly | Short to medium |
| Final answers | For identical requests | Short (minutes to hours) |
| System prompts | Static prompts | Long |
Prompt Caching
Major providers (OpenAI, Anthropic, Google) offer prompt caching that significantly reduces costs for repeated system prompts.
Best practices for prompt caching:
| Practice | Why |
|---|---|
| Place dynamic content at end | Maximizes cached prefix |
| Avoid dynamic function calling | Invalidates cache |
| Exclude dynamic tool results | Keep static portions cacheable |
| Consistent system prompts | Same prompt = cache hit |
Warning: Naive caching can paradoxically increase latency if cache blocks are positioned poorly. Test your caching strategy carefully.
Semantic Caching
Beyond exact-match caching, semantic caching identifies similar queries:
Query: "What's the refund policy?"
↓
Embed query
↓
Search cache for similar embeddings
↓
If similarity > threshold:
Return cached response
↓
Else:
Generate new response
Store in cache
Semantic caching trade-offs:
| Benefit | Risk |
|---|---|
| Higher cache hit rate | May return slightly wrong answers |
| Works across paraphrases | Similarity threshold is tricky |
| Reduces redundant computation | Requires embedding overhead |
Cache Invalidation
The hard part. Strategies:
| Strategy | Use Case |
|---|---|
| TTL (time-to-live) | Content that changes on schedule |
| Event-based | Invalidate when source data changes |
| Version-based | Tie cache to content version |
| Lazy invalidation | Check freshness on read |
Batching: Reduce Overhead, Reduce Latency
Batching combines multiple operations to amortize overhead.
What to Batch
| Operation | Batching Approach |
|---|---|
| Retrieval queries | Combine multiple queries in one vector search |
| Embeddings | Batch document embedding requests |
| Tool calls | Group independent tool calls |
| Analytics writes | Buffer and batch writes |
| Postprocessing | Process multiple responses together |
Batching Architecture
Multiple Requests
↓
┌────────────────────────────┐
│ Request Queue │
│ (collect for N ms) │
└────────────────────────────┘
↓
┌────────────────────────────┐
│ Batch Processor │
│ (single API call) │
└────────────────────────────┘
↓
Distribute responses
Tool Call Batching
Instead of sequential tool calls:
❌ Slow:
call tool_a() → wait → result
call tool_b() → wait → result
call tool_c() → wait → result
Total: 3x latency
Batch independent calls:
✅ Fast:
call [tool_a, tool_b, tool_c] in parallel
wait → [result_a, result_b, result_c]
Total: 1x latency
Embedding Batching
❌ Expensive:
embed(doc_1) → wait
embed(doc_2) → wait
embed(doc_3) → wait
3 API calls, 3x overhead
✅ Efficient:
embed([doc_1, doc_2, doc_3]) → wait
1 API call, shared overhead
Budget-Aware Routing
Combine routing with cost awareness:
Cost Tracking
Track spend in real-time:
class CostTracker:
def track_request(self, model: str, tokens: int):
cost = self.calculate_cost(model, tokens)
self.daily_spend += cost
self.check_budget_alerts()
Budget Enforcement
| Strategy | Implementation |
|---|---|
| Soft limits | Alert when approaching budget |
| Hard limits | Downgrade models when over budget |
| Rate limiting | Slow down requests at budget threshold |
| Feature gating | Disable expensive features when over budget |
Dynamic Routing Based on Budget
def route_with_budget(request: Request) -> str:
remaining_budget = get_remaining_daily_budget()
if remaining_budget < threshold_critical:
# Emergency mode: cheapest only
return "gpt-4o-mini"
if remaining_budget < threshold_warning:
# Conservative mode: avoid premium
if request.complexity < 0.9:
return "gpt-4o-mini"
# Normal routing
return route_by_complexity(request)
Measuring Cost Optimization
Key Metrics
| Metric | Definition | Target |
|---|---|---|
| Cost per request | Total cost / requests | Minimize |
| Cost per task | Total cost / completed tasks | Track trend |
| Cache hit rate | Cache hits / total requests | Maximize (≥40%) |
| Routing accuracy | Right model chosen / total | Maximize (≥90%) |
| Fallback rate | Fallbacks / total requests | Minimize (<10%) |
| Quality at cost | Quality score / cost | Maximize |
Cost Attribution
Know where spend goes:
Total Monthly Spend: $10,000
├── Model inference: $6,500 (65%)
│ ├── GPT-4o: $4,000
│ ├── GPT-4o-mini: $1,500
│ └── Claude Opus: $1,000
├── Embeddings: $1,500 (15%)
├── Vector DB: $1,000 (10%)
├── Storage: $500 (5%)
└── Other: $500 (5%)
Implementation Playbook
Phase 1: Baseline (Week 1)
- Instrument all LLM calls with cost tracking
- Measure current cost per request
- Identify highest-cost operations
- Establish quality baselines
Phase 2: Quick Wins (Week 2-3)
- Enable provider prompt caching
- Add exact-match response caching
- Batch embedding requests
- Review and optimize system prompts
Phase 3: Routing (Week 4-5)
- Build complexity classifier
- Implement tiered model routing
- Add quality fallback logic
- Monitor routing accuracy
Phase 4: Advanced Caching (Week 6-7)
- Implement semantic caching
- Add retrieval result caching
- Build cache invalidation logic
- Tune cache TTLs
Phase 5: Optimization (Ongoing)
- Weekly cost reviews
- A/B test routing rules
- Tune cache thresholds
- Model cost/quality trade-offs
Anti-Patterns to Avoid
Anti-Pattern 1: Over-Caching
| Problem | Why It Hurts |
|---|---|
| Caching dynamic content | Stale answers |
| Too-long TTLs | Outdated responses |
| Caching without invalidation | Data inconsistency |
Anti-Pattern 2: Wrong Routing Signals
| Bad Signal | Why It Fails |
|---|---|
| Message length | Long ≠ complex |
| User tier | Premium users may have simple requests |
| Time of day | No correlation with complexity |
Anti-Pattern 3: Ignoring Quality
| Problem | Consequence |
|---|---|
| Optimizing cost only | Quality drops, users churn |
| No quality monitoring | Silent degradation |
| No fallback path | Cheap model failures unrecovered |
FAQ
Does caching hurt personalization?
Only if you cache the wrong layer. Cache deterministic sub-results (embeddings, retrieval, tool outputs), not user-specific decisions. The final response generation should be fresh.
How do I know if my routing is working?
Track three metrics:
- Fallback rate (should be <10%)
- Quality score per tier (should meet thresholds)
- Cost per task (should trend down)
What’s the right cache hit rate target?
Depends on your use case:
- FAQ-heavy applications: 40-60%
- Dynamic conversations: 10-30%
- Workflow automation: 30-50%
Should I build or buy routing/caching?
| Build | Buy |
|---|---|
| Custom routing logic | Standard caching layers |
| Quality evaluation | Provider prompt caching |
| Domain-specific needs | Generic infrastructure |
How do I balance cost and quality?
- Define minimum quality thresholds
- Optimize cost within those constraints
- Monitor quality continuously
- Alert on quality degradation
- Have fallback paths to higher-quality models
What about fine-tuned models for cost?
Fine-tuned smaller models can reduce costs for high-volume, narrow use cases. Prerequisites:
- Sufficient training data
- Well-defined task
- High volume to amortize training cost
Sources & Further Reading
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch