AI Research #cost#latency#routing

LLM Cost Optimization in 2026: Routing, Caching, and Batching

Cost is a product constraint. A practical playbook for reducing LLM spend 47-80% without degrading UX: route smart, cache strategically, and batch tool work.

15 min · January 26, 2026 · Updated January 27, 2026

TL;DR

Cost is a product constraint that affects pricing, margin, and scale
Prompt caching can reduce API costs by 45-80% and improve time-to-first-token by 13-31%
Semantic caching + budget-aware routing achieves 47% spend reduction in production
Route easy tasks to cheap paths; reserve heavy models for hard tasks
Cache anything deterministic and frequently repeated
Batch tool calls and retrieval queries to reduce overhead
Beyond tokens: account for data storage, retrieval, and infrastructure costs

Why Cost Optimization Matters

LLM costs affect more than your cloud bill:

Impact	Consequence
Product pricing	High costs = higher prices = smaller market
Margin	Thin margins limit growth investment
Scale	10x users = 10x cost without optimization
Feature decisions	Expensive features don’t ship
Competitive position	Cheaper competitors win on price

The Real Cost Structure

Beyond token costs, production LLM systems have hidden expenses:

Cost Category	Examples
Model inference	API calls, per-token pricing
Retrieval	Vector DB queries, embedding generation
Storage	Conversation logs, embeddings, caches
Compute	Preprocessing, postprocessing, orchestration
Infrastructure	Load balancing, monitoring, failover

Holistic cost optimization addresses all of these, not just token spend.

The Three Pillars of Cost Optimization

Pillar 1: Routing

Direct each request to the most cost-effective path that meets quality requirements.

Pillar 2: Caching

Store and reuse results for repeated or similar requests.

Pillar 3: Batching

Combine multiple operations to reduce overhead and improve efficiency.

The combination of all three achieves the best results — 47-80% cost reduction in production systems.

Routing: Cheap by Default, Strong by Exception

The goal isn’t “always use the best model” — it’s best outcome per dollar.

The Routing Decision

For each request, determine:

Question	Routing Implication
Is this request simple?	Use cheaper, faster model
Does it need tools?	Route to tool-capable model
Does it need long context?	Route to large-context model
Is quality critical?	Route to best model
Is latency critical?	Route to fastest model

Routing Architecture

Incoming Request
       ↓
┌────────────────────────────┐
│      Intent Classifier      │
│  (cheap model or rules)     │
└────────────────────────────┘
       ↓
┌────────────────────────────┐
│     Complexity Scorer       │
│  (simple/medium/complex)    │
└────────────────────────────┘
       ↓
┌────────────────────────────────────────────┐
│              Model Router                   │
├──────────┬──────────────┬──────────────────┤
│  Simple  │   Medium     │     Complex      │
│  GPT-4o- │   GPT-4o     │    Claude Opus   │
│  mini    │              │    GPT-4o+       │
└──────────┴──────────────┴──────────────────┘
       ↓
   Response with quality check
       ↓
   Fallback to higher tier if needed

Routing Tiers

Tier	Model Class	Use Cases	Cost
Tier 1	Small/fast (GPT-4o-mini)	Simple Q&A, classification, formatting	$$
Tier 2	Standard (GPT-4o, Claude 3.5)	General tasks, moderate complexity	$$$
Tier 3	Premium (GPT-4o+, Claude Opus)	Complex reasoning, critical tasks	$$$$

Routing Rules Example

def route_request(request: Request) -> str:
    # Simple classification tasks
    if request.task_type == "classify":
        return "gpt-4o-mini"
    
    # Long context needs specific model
    if request.token_count > 32000:
        return "gpt-4o-128k"
    
    # Complex reasoning
    if request.complexity_score > 0.8:
        return "claude-opus"
    
    # Tool-heavy workflows
    if request.requires_tools:
        return "gpt-4o"  # Good tool performance
    
    # Default to cost-effective
    return "gpt-4o-mini"

Quality Fallback

Cheap models sometimes fail. Build in fallback:

Request → Tier 1 model
            ↓
      Quality check
            ↓
    Pass? → Return response
            ↓
    Fail? → Retry with Tier 2
            ↓
    Still fail? → Tier 3

Track fallback rates to optimize routing rules.

Caching: The Most Underused Lever

Caching is often the highest-ROI optimization. Research shows:

Benefit	Impact
Cost reduction	45-80% with strategic caching
Latency improvement	13-31% faster time-to-first-token
Consistency	Same inputs = same outputs

What to Cache

Cache Target	When to Cache	Cache Duration
Embeddings	Always for repeated documents	Long (until content changes)
Retrieval results	For stable documents	Medium (hours to days)
Tool outputs	When output doesn’t change quickly	Short to medium
Final answers	For identical requests	Short (minutes to hours)
System prompts	Static prompts	Long

Prompt Caching

Major providers (OpenAI, Anthropic, Google) offer prompt caching that significantly reduces costs for repeated system prompts.

Best practices for prompt caching:

Practice	Why
Place dynamic content at end	Maximizes cached prefix
Avoid dynamic function calling	Invalidates cache
Exclude dynamic tool results	Keep static portions cacheable
Consistent system prompts	Same prompt = cache hit

Warning: Naive caching can paradoxically increase latency if cache blocks are positioned poorly. Test your caching strategy carefully.

Semantic Caching

Beyond exact-match caching, semantic caching identifies similar queries:

Query: "What's the refund policy?"
       ↓
Embed query
       ↓
Search cache for similar embeddings
       ↓
If similarity > threshold:
   Return cached response
       ↓
Else:
   Generate new response
   Store in cache

Semantic caching trade-offs:

Benefit	Risk
Higher cache hit rate	May return slightly wrong answers
Works across paraphrases	Similarity threshold is tricky
Reduces redundant computation	Requires embedding overhead

Cache Invalidation

The hard part. Strategies:

Strategy	Use Case
TTL (time-to-live)	Content that changes on schedule
Event-based	Invalidate when source data changes
Version-based	Tie cache to content version
Lazy invalidation	Check freshness on read

Batching: Reduce Overhead, Reduce Latency

Batching combines multiple operations to amortize overhead.

What to Batch

Operation	Batching Approach
Retrieval queries	Combine multiple queries in one vector search
Embeddings	Batch document embedding requests
Tool calls	Group independent tool calls
Analytics writes	Buffer and batch writes
Postprocessing	Process multiple responses together

Batching Architecture

Multiple Requests
       ↓
┌────────────────────────────┐
│       Request Queue         │
│   (collect for N ms)        │
└────────────────────────────┘
       ↓
┌────────────────────────────┐
│       Batch Processor       │
│   (single API call)         │
└────────────────────────────┘
       ↓
   Distribute responses

Tool Call Batching

Instead of sequential tool calls:

❌ Slow:
  call tool_a() → wait → result
  call tool_b() → wait → result
  call tool_c() → wait → result
  Total: 3x latency

Batch independent calls:

✅ Fast:
  call [tool_a, tool_b, tool_c] in parallel
  wait → [result_a, result_b, result_c]
  Total: 1x latency

Embedding Batching

❌ Expensive:
  embed(doc_1) → wait
  embed(doc_2) → wait
  embed(doc_3) → wait
  3 API calls, 3x overhead

✅ Efficient:
  embed([doc_1, doc_2, doc_3]) → wait
  1 API call, shared overhead

Budget-Aware Routing

Combine routing with cost awareness:

Cost Tracking

Track spend in real-time:

class CostTracker:
    def track_request(self, model: str, tokens: int):
        cost = self.calculate_cost(model, tokens)
        self.daily_spend += cost
        self.check_budget_alerts()

Budget Enforcement

Strategy	Implementation
Soft limits	Alert when approaching budget
Hard limits	Downgrade models when over budget
Rate limiting	Slow down requests at budget threshold
Feature gating	Disable expensive features when over budget

Dynamic Routing Based on Budget

def route_with_budget(request: Request) -> str:
    remaining_budget = get_remaining_daily_budget()
    
    if remaining_budget < threshold_critical:
        # Emergency mode: cheapest only
        return "gpt-4o-mini"
    
    if remaining_budget < threshold_warning:
        # Conservative mode: avoid premium
        if request.complexity < 0.9:
            return "gpt-4o-mini"
    
    # Normal routing
    return route_by_complexity(request)

Measuring Cost Optimization

Key Metrics

Metric	Definition	Target
Cost per request	Total cost / requests	Minimize
Cost per task	Total cost / completed tasks	Track trend
Cache hit rate	Cache hits / total requests	Maximize (≥40%)
Routing accuracy	Right model chosen / total	Maximize (≥90%)
Fallback rate	Fallbacks / total requests	Minimize (<10%)
Quality at cost	Quality score / cost	Maximize

Cost Attribution

Know where spend goes:

Total Monthly Spend: $10,000
├── Model inference: $6,500 (65%)
│   ├── GPT-4o: $4,000
│   ├── GPT-4o-mini: $1,500
│   └── Claude Opus: $1,000
├── Embeddings: $1,500 (15%)
├── Vector DB: $1,000 (10%)
├── Storage: $500 (5%)
└── Other: $500 (5%)

Implementation Playbook

Phase 1: Baseline (Week 1)

Instrument all LLM calls with cost tracking
Measure current cost per request
Identify highest-cost operations
Establish quality baselines

Phase 2: Quick Wins (Week 2-3)

Enable provider prompt caching
Add exact-match response caching
Batch embedding requests
Review and optimize system prompts

Phase 3: Routing (Week 4-5)

Build complexity classifier
Implement tiered model routing
Add quality fallback logic
Monitor routing accuracy

Phase 4: Advanced Caching (Week 6-7)

Implement semantic caching
Add retrieval result caching
Build cache invalidation logic
Tune cache TTLs

Phase 5: Optimization (Ongoing)

Weekly cost reviews
A/B test routing rules
Tune cache thresholds
Model cost/quality trade-offs

Anti-Patterns to Avoid

Anti-Pattern 1: Over-Caching

Problem	Why It Hurts
Caching dynamic content	Stale answers
Too-long TTLs	Outdated responses
Caching without invalidation	Data inconsistency

Anti-Pattern 2: Wrong Routing Signals

Bad Signal	Why It Fails
Message length	Long ≠ complex
User tier	Premium users may have simple requests
Time of day	No correlation with complexity

Anti-Pattern 3: Ignoring Quality

Problem	Consequence
Optimizing cost only	Quality drops, users churn
No quality monitoring	Silent degradation
No fallback path	Cheap model failures unrecovered

FAQ

Does caching hurt personalization?

Only if you cache the wrong layer. Cache deterministic sub-results (embeddings, retrieval, tool outputs), not user-specific decisions. The final response generation should be fresh.

How do I know if my routing is working?

Track three metrics:

Fallback rate (should be <10%)
Quality score per tier (should meet thresholds)
Cost per task (should trend down)

What’s the right cache hit rate target?

Depends on your use case:

FAQ-heavy applications: 40-60%
Dynamic conversations: 10-30%
Workflow automation: 30-50%

Should I build or buy routing/caching?

Build	Buy
Custom routing logic	Standard caching layers
Quality evaluation	Provider prompt caching
Domain-specific needs	Generic infrastructure

How do I balance cost and quality?

Define minimum quality thresholds
Optimize cost within those constraints
Monitor quality continuously
Alert on quality degradation
Have fallback paths to higher-quality models

What about fine-tuned models for cost?

Fine-tuned smaller models can reduce costs for high-volume, narrow use cases. Prerequisites:

Sufficient training data
Well-defined task
High volume to amortize training cost

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch