Back to blog
AI Research #cost#latency#routing

LLM Cost Optimization in 2026: Routing, Caching, and Batching

Cost is a product constraint. A practical playbook for reducing LLM spend 47-80% without degrading UX: route smart, cache strategically, and batch tool work.

15 min · January 26, 2026 · Updated January 27, 2026
Topic relevant background image

TL;DR

  • Cost is a product constraint that affects pricing, margin, and scale
  • Prompt caching can reduce API costs by 45-80% and improve time-to-first-token by 13-31%
  • Semantic caching + budget-aware routing achieves 47% spend reduction in production
  • Route easy tasks to cheap paths; reserve heavy models for hard tasks
  • Cache anything deterministic and frequently repeated
  • Batch tool calls and retrieval queries to reduce overhead
  • Beyond tokens: account for data storage, retrieval, and infrastructure costs

Why Cost Optimization Matters

LLM costs affect more than your cloud bill:

ImpactConsequence
Product pricingHigh costs = higher prices = smaller market
MarginThin margins limit growth investment
Scale10x users = 10x cost without optimization
Feature decisionsExpensive features don’t ship
Competitive positionCheaper competitors win on price

The Real Cost Structure

Beyond token costs, production LLM systems have hidden expenses:

Cost CategoryExamples
Model inferenceAPI calls, per-token pricing
RetrievalVector DB queries, embedding generation
StorageConversation logs, embeddings, caches
ComputePreprocessing, postprocessing, orchestration
InfrastructureLoad balancing, monitoring, failover

Holistic cost optimization addresses all of these, not just token spend.


The Three Pillars of Cost Optimization

Pillar 1: Routing

Direct each request to the most cost-effective path that meets quality requirements.

Pillar 2: Caching

Store and reuse results for repeated or similar requests.

Pillar 3: Batching

Combine multiple operations to reduce overhead and improve efficiency.

The combination of all three achieves the best results — 47-80% cost reduction in production systems.


Routing: Cheap by Default, Strong by Exception

The goal isn’t “always use the best model” — it’s best outcome per dollar.

The Routing Decision

For each request, determine:

QuestionRouting Implication
Is this request simple?Use cheaper, faster model
Does it need tools?Route to tool-capable model
Does it need long context?Route to large-context model
Is quality critical?Route to best model
Is latency critical?Route to fastest model

Routing Architecture

Incoming Request

┌────────────────────────────┐
│      Intent Classifier      │
│  (cheap model or rules)     │
└────────────────────────────┘

┌────────────────────────────┐
│     Complexity Scorer       │
│  (simple/medium/complex)    │
└────────────────────────────┘

┌────────────────────────────────────────────┐
│              Model Router                   │
├──────────┬──────────────┬──────────────────┤
│  Simple  │   Medium     │     Complex      │
│  GPT-4o- │   GPT-4o     │    Claude Opus   │
│  mini    │              │    GPT-4o+       │
└──────────┴──────────────┴──────────────────┘

   Response with quality check

   Fallback to higher tier if needed

Routing Tiers

TierModel ClassUse CasesCost
Tier 1Small/fast (GPT-4o-mini)Simple Q&A, classification, formatting$$
Tier 2Standard (GPT-4o, Claude 3.5)General tasks, moderate complexity$$$
Tier 3Premium (GPT-4o+, Claude Opus)Complex reasoning, critical tasks$$$$

Routing Rules Example

def route_request(request: Request) -> str:
    # Simple classification tasks
    if request.task_type == "classify":
        return "gpt-4o-mini"
    
    # Long context needs specific model
    if request.token_count > 32000:
        return "gpt-4o-128k"
    
    # Complex reasoning
    if request.complexity_score > 0.8:
        return "claude-opus"
    
    # Tool-heavy workflows
    if request.requires_tools:
        return "gpt-4o"  # Good tool performance
    
    # Default to cost-effective
    return "gpt-4o-mini"

Quality Fallback

Cheap models sometimes fail. Build in fallback:

Request → Tier 1 model

      Quality check

    Pass? → Return response

    Fail? → Retry with Tier 2

    Still fail? → Tier 3

Track fallback rates to optimize routing rules.


Caching: The Most Underused Lever

Caching is often the highest-ROI optimization. Research shows:

BenefitImpact
Cost reduction45-80% with strategic caching
Latency improvement13-31% faster time-to-first-token
ConsistencySame inputs = same outputs

What to Cache

Cache TargetWhen to CacheCache Duration
EmbeddingsAlways for repeated documentsLong (until content changes)
Retrieval resultsFor stable documentsMedium (hours to days)
Tool outputsWhen output doesn’t change quicklyShort to medium
Final answersFor identical requestsShort (minutes to hours)
System promptsStatic promptsLong

Prompt Caching

Major providers (OpenAI, Anthropic, Google) offer prompt caching that significantly reduces costs for repeated system prompts.

Best practices for prompt caching:

PracticeWhy
Place dynamic content at endMaximizes cached prefix
Avoid dynamic function callingInvalidates cache
Exclude dynamic tool resultsKeep static portions cacheable
Consistent system promptsSame prompt = cache hit

Warning: Naive caching can paradoxically increase latency if cache blocks are positioned poorly. Test your caching strategy carefully.

Semantic Caching

Beyond exact-match caching, semantic caching identifies similar queries:

Query: "What's the refund policy?"

Embed query

Search cache for similar embeddings

If similarity > threshold:
   Return cached response

Else:
   Generate new response
   Store in cache

Semantic caching trade-offs:

BenefitRisk
Higher cache hit rateMay return slightly wrong answers
Works across paraphrasesSimilarity threshold is tricky
Reduces redundant computationRequires embedding overhead

Cache Invalidation

The hard part. Strategies:

StrategyUse Case
TTL (time-to-live)Content that changes on schedule
Event-basedInvalidate when source data changes
Version-basedTie cache to content version
Lazy invalidationCheck freshness on read

Batching: Reduce Overhead, Reduce Latency

Batching combines multiple operations to amortize overhead.

What to Batch

OperationBatching Approach
Retrieval queriesCombine multiple queries in one vector search
EmbeddingsBatch document embedding requests
Tool callsGroup independent tool calls
Analytics writesBuffer and batch writes
PostprocessingProcess multiple responses together

Batching Architecture

Multiple Requests

┌────────────────────────────┐
│       Request Queue         │
│   (collect for N ms)        │
└────────────────────────────┘

┌────────────────────────────┐
│       Batch Processor       │
│   (single API call)         │
└────────────────────────────┘

   Distribute responses

Tool Call Batching

Instead of sequential tool calls:

❌ Slow:
  call tool_a() → wait → result
  call tool_b() → wait → result
  call tool_c() → wait → result
  Total: 3x latency

Batch independent calls:

✅ Fast:
  call [tool_a, tool_b, tool_c] in parallel
  wait → [result_a, result_b, result_c]
  Total: 1x latency

Embedding Batching

❌ Expensive:
  embed(doc_1) → wait
  embed(doc_2) → wait
  embed(doc_3) → wait
  3 API calls, 3x overhead

✅ Efficient:
  embed([doc_1, doc_2, doc_3]) → wait
  1 API call, shared overhead

Budget-Aware Routing

Combine routing with cost awareness:

Cost Tracking

Track spend in real-time:

class CostTracker:
    def track_request(self, model: str, tokens: int):
        cost = self.calculate_cost(model, tokens)
        self.daily_spend += cost
        self.check_budget_alerts()

Budget Enforcement

StrategyImplementation
Soft limitsAlert when approaching budget
Hard limitsDowngrade models when over budget
Rate limitingSlow down requests at budget threshold
Feature gatingDisable expensive features when over budget

Dynamic Routing Based on Budget

def route_with_budget(request: Request) -> str:
    remaining_budget = get_remaining_daily_budget()
    
    if remaining_budget < threshold_critical:
        # Emergency mode: cheapest only
        return "gpt-4o-mini"
    
    if remaining_budget < threshold_warning:
        # Conservative mode: avoid premium
        if request.complexity < 0.9:
            return "gpt-4o-mini"
    
    # Normal routing
    return route_by_complexity(request)

Measuring Cost Optimization

Key Metrics

MetricDefinitionTarget
Cost per requestTotal cost / requestsMinimize
Cost per taskTotal cost / completed tasksTrack trend
Cache hit rateCache hits / total requestsMaximize (≥40%)
Routing accuracyRight model chosen / totalMaximize (≥90%)
Fallback rateFallbacks / total requestsMinimize (<10%)
Quality at costQuality score / costMaximize

Cost Attribution

Know where spend goes:

Total Monthly Spend: $10,000
├── Model inference: $6,500 (65%)
│   ├── GPT-4o: $4,000
│   ├── GPT-4o-mini: $1,500
│   └── Claude Opus: $1,000
├── Embeddings: $1,500 (15%)
├── Vector DB: $1,000 (10%)
├── Storage: $500 (5%)
└── Other: $500 (5%)

Implementation Playbook

Phase 1: Baseline (Week 1)

  • Instrument all LLM calls with cost tracking
  • Measure current cost per request
  • Identify highest-cost operations
  • Establish quality baselines

Phase 2: Quick Wins (Week 2-3)

  • Enable provider prompt caching
  • Add exact-match response caching
  • Batch embedding requests
  • Review and optimize system prompts

Phase 3: Routing (Week 4-5)

  • Build complexity classifier
  • Implement tiered model routing
  • Add quality fallback logic
  • Monitor routing accuracy

Phase 4: Advanced Caching (Week 6-7)

  • Implement semantic caching
  • Add retrieval result caching
  • Build cache invalidation logic
  • Tune cache TTLs

Phase 5: Optimization (Ongoing)

  • Weekly cost reviews
  • A/B test routing rules
  • Tune cache thresholds
  • Model cost/quality trade-offs

Anti-Patterns to Avoid

Anti-Pattern 1: Over-Caching

ProblemWhy It Hurts
Caching dynamic contentStale answers
Too-long TTLsOutdated responses
Caching without invalidationData inconsistency

Anti-Pattern 2: Wrong Routing Signals

Bad SignalWhy It Fails
Message lengthLong ≠ complex
User tierPremium users may have simple requests
Time of dayNo correlation with complexity

Anti-Pattern 3: Ignoring Quality

ProblemConsequence
Optimizing cost onlyQuality drops, users churn
No quality monitoringSilent degradation
No fallback pathCheap model failures unrecovered

FAQ

Does caching hurt personalization?

Only if you cache the wrong layer. Cache deterministic sub-results (embeddings, retrieval, tool outputs), not user-specific decisions. The final response generation should be fresh.

How do I know if my routing is working?

Track three metrics:

  • Fallback rate (should be <10%)
  • Quality score per tier (should meet thresholds)
  • Cost per task (should trend down)

What’s the right cache hit rate target?

Depends on your use case:

  • FAQ-heavy applications: 40-60%
  • Dynamic conversations: 10-30%
  • Workflow automation: 30-50%

Should I build or buy routing/caching?

BuildBuy
Custom routing logicStandard caching layers
Quality evaluationProvider prompt caching
Domain-specific needsGeneric infrastructure

How do I balance cost and quality?

  1. Define minimum quality thresholds
  2. Optimize cost within those constraints
  3. Monitor quality continuously
  4. Alert on quality degradation
  5. Have fallback paths to higher-quality models

What about fine-tuned models for cost?

Fine-tuned smaller models can reduce costs for high-volume, narrow use cases. Prerequisites:

  • Sufficient training data
  • Well-defined task
  • High volume to amortize training cost

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Let's build
something real.

No more slide decks. No more "maybe next quarter".
Let's ship your MVP in weeks.

Start Building Now