Back to blog
Agents #observability#agents#tracing

Agent Observability in 2026: Traces, Costs, and Failure Modes

If you can't see why an agent failed, you can't fix it. A practical observability model for tool-using agents: traces, spans, replay, and the metrics that matter.

14 min · January 4, 2026 · Updated January 27, 2026
Topic relevant background image

TL;DR

  • You need traces for: model calls, tool calls, retrieval, and UI actions
  • Store enough detail to replay failures — not just logs, but reproducible state
  • Key metrics: success rate per workflow, escalation rate, p95 latency, cost per successful completion
  • OpenTelemetry is emerging as the standard for unified agent observability
  • LangSmith and similar platforms provide LLM-specific visualization and debugging
  • Chat transcripts are not enough — you need structured traces to diagnose tool errors

Why Agent Observability Is Different

Traditional application monitoring doesn’t capture what makes agents unique:

Traditional AppAgent System
Request → ResponseRequest → Planning → Tools → Response
DeterministicNon-deterministic
Clear error states”It answered, but wrongly”
Stack tracesDecision traces
Code bugsPrompt/context/tool bugs

The Debugging Challenge

When an agent fails, you need to answer:

  • What was the user’s actual intent?
  • What context did the agent have?
  • Which tools were called with what parameters?
  • What did each tool return?
  • How did the agent interpret those results?
  • Why did it produce this specific output?

Without observability, you’re guessing.


What an Agent Trace Should Include

The Minimum Trace Model

LayerWhat to Capture
RequestUser input, session context, user state
IntentClassified intent, confidence, alternatives
PlanningSteps planned, reasoning (if available)
ToolsEach call: name, inputs, outputs, timing
RetrievalQuery, docs retrieved, relevance scores
GenerationPrompt assembled, tokens used, response
ValidationChecks run, results, failures
OutputFinal response, schema validation

Trace Structure

Trace: agent-request-12345
├── Span: intent-classification (15ms)
│   ├── input: "Refund my order 12345"
│   ├── intent: "refund_request"
│   └── confidence: 0.94
├── Span: context-retrieval (120ms)
│   ├── query: "refund policy order 12345"
│   ├── docs_retrieved: 3
│   └── relevance_scores: [0.89, 0.72, 0.68]
├── Span: tool-call-get_order (85ms)
│   ├── tool: "get_order_status"
│   ├── params: {"order_id": "12345"}
│   └── result: {"status": "delivered", ...}
├── Span: tool-call-initiate_refund (150ms)
│   ├── tool: "initiate_refund"
│   ├── params: {"order_id": "12345", "amount": 99.99}
│   └── result: {"success": true, "refund_id": "ref_789"}
├── Span: response-generation (200ms)
│   ├── tokens_in: 1250
│   ├── tokens_out: 180
│   └── model: "gpt-4o"
└── Span: validation (5ms)
    ├── schema_valid: true
    └── policy_compliant: true

What to Store vs. Redact

StoreRedact
Intent and classificationPII in user messages
Tool names and timingSensitive tool parameters
Success/failure statusAPI keys, secrets
Token countsFull conversation (store reference)
Error messagesCredit card numbers

The Metrics That Actually Matter

Don’t drown in data. Focus on what drives decisions.

Primary Metrics

MetricDefinitionWhy It Matters
Success rate (per workflow)Completed / AttemptedCore reliability
Escalation rateEscalated / TotalAgent capability
P95 latency (per workflow)95th percentile durationUser experience
Cost per successful completionTotal cost / SuccessesUnit economics

Secondary Metrics

MetricDefinitionPurpose
Tool call success rateSuccessful calls / TotalTool reliability
Retrieval precisionRelevant docs / RetrievedRAG quality
Schema validation rateValid / TotalOutput quality
Policy violation rateViolations / TotalSafety
Retry rateRetries / AttemptsError handling

Metric Hierarchy

North Star: Successful task completion rate

Supporting: Escalation rate, User satisfaction

Diagnostic: Tool success, Latency, Cost

Granular: Token usage, Cache hit rate, Model distribution

Cost Per Outcome

Track not just token cost, but total cost per successful outcome:

Cost per success = (
    Model costs +
    Tool API costs +
    Retrieval costs +
    Infrastructure costs
) / Successful completions

This reveals whether your optimizations actually improve unit economics.


Replay: The Debugging Superpower

When something breaks, you want to reproduce it exactly.

What Enables Replay

ComponentRequirement
User inputExact message, context
System stateUser info, permissions, prior conversation
Tool outputsExact responses (cached)
Retrieval resultsExact documents (snapshot)
Model versionWhich model at what time
ConfigurationPrompt versions, parameters

Replay Architecture

Production Request

Capture full context

Store in trace store

On failure investigation:

Replay engine loads:
  - Original inputs
  - Cached tool outputs
  - Document snapshots

Run agent with original context

Compare: Same output? Different?

Test fixes safely

Replay vs. Re-run

Re-runReplay
Call live tools againUse cached tool outputs
May get different resultsExactly reproducible
Side effects possibleSide-effect free
Tests current stateTests historical state

OpenTelemetry for Agents

OpenTelemetry (OTel) is emerging as the standard for unified agent observability.

Why OpenTelemetry

BenefitDescription
Unified standardOne format across all components
Distributed tracingFollow requests across services
InteroperabilityWorks with Datadog, Grafana, Jaeger, etc.
Context propagationLink related spans automatically
Rich semantic conventionsStandard meanings for common operations

OpenTelemetry + LangSmith

LangSmith now supports end-to-end OpenTelemetry:

  1. LangChain instrumentation generates detailed traces
  2. LangSmith SDK converts to OpenTelemetry format
  3. Platform ingests and visualizes with LLM-specific features

Semantic Conventions for AI

Emerging standards for AI-specific spans:

Span TypeAttributes
llm.completionmodel, tokens_in, tokens_out, temperature
tool.calltool_name, parameters, result_status
retrieval.queryquery, doc_count, relevance
agent.stepstep_type, confidence, decision

Observability Tools Landscape

LangSmith

Best for: LangChain/LangGraph ecosystems, dedicated LLM debugging

FeatureDescription
Trace visualizationHierarchical span view
PlaygroundRe-run with modified prompts
DatasetsTest against golden examples
MonitoringReal-time dashboards
AnnotationsHuman feedback on traces

Generic Observability + LLM Extensions

ToolApproach
DatadogAdd LLM-specific metrics/traces
GrafanaCustom dashboards for agent metrics
HoneycombEvent-based debugging

Custom Implementation

For specialized needs:

Application Layer

Custom instrumentation (trace spans)

Export to:
  - Object storage (S3) for raw traces
  - Time-series DB for metrics
  - Search (Elasticsearch) for log analysis

Visualization dashboard

Alert Design for Agents

What to Alert On

ConditionPriorityAction
Success rate < 90%CriticalPage on-call
P95 latency > 10sHighInvestigate
Tool failure rate > 5%HighCheck tool health
Cost spike > 2xMediumReview for anomaly
Escalation rate > 20%MediumCheck agent quality
Policy violationCriticalImmediate review

Alert Fatigue Prevention

StrategyImplementation
DeduplicateGroup related alerts
Threshold hysteresisRequire sustained state change
Severity tiersCritical/High/Medium/Low actions differ
RunbooksEach alert has clear response steps

Anomaly Detection

For agents, rule-based alerts aren’t enough. Watch for:

  • Sudden distribution shifts in tool usage
  • Unusual token consumption patterns
  • Confidence score distribution changes
  • New error types appearing

Debugging Workflows

Investigation Flow

Alert: Success rate dropped

1. Check time range: When did it start?

2. Check scope: All workflows or specific?

3. Find failing traces

4. Analyze failure patterns:
   - Same tool failing?
   - Same error type?
   - Same user segment?

5. Replay representative failures

6. Identify root cause:
   - Tool issue
   - Prompt issue
   - Data issue
   - Model issue

7. Fix and verify

8. Post-incident review

Common Failure Patterns

PatternSignsInvestigation
Tool degradationTool call latency/errors spikeCheck tool health
RAG quality dropLow relevance scoresCheck index freshness
Prompt regressionNew prompt version correlatesCompare prompt versions
Model changeProvider update timeline matchesCheck model behavior
Data issueSpecific users/inputs affectedCheck data quality

Implementation Checklist

Instrumentation:

  • Add trace spans for all major operations
  • Capture tool calls with parameters and results
  • Log retrieval queries and results
  • Record token usage per call
  • Store validation results

Storage:

  • Set up trace storage with retention policy
  • Implement PII redaction
  • Configure replay capability
  • Set up cost tracking

Metrics:

  • Define primary KPIs (success, latency, cost)
  • Create per-workflow dashboards
  • Implement trend tracking
  • Set up metric export

Alerting:

  • Configure critical alerts (success rate, errors)
  • Set up anomaly detection
  • Create alert runbooks
  • Test alert pathways

Integration:

  • Choose observability platform
  • Implement OpenTelemetry export
  • Connect to existing infrastructure
  • Set up cross-service correlation

FAQ

Is “chat transcripts” enough?

No. You need structured traces to diagnose tool errors and routing mistakes. Chat transcripts show the conversation but not the internal decisions, tool calls, or retrieval that produced the response.

How much trace data should I store?

RetentionWhat to Keep
7 daysFull traces, all spans
30 daysSummary traces, sampled full
90 daysMetrics and error traces only
PermanentIncidents and golden tests

Do I need OpenTelemetry?

Not strictly, but it provides:

  • Standard format that works across tools
  • Distributed tracing capability
  • Growing ecosystem support
  • Future-proof investment

How do I trace across multiple agents?

Use distributed tracing with context propagation:

  • Pass trace ID between agents
  • Use parent-child span relationships
  • Correlate by session ID

What’s the overhead of tracing?

ComponentOverhead
Span creation< 1ms typically
Context propagationNegligible
ExportBatched, async
StorageMain cost driver

Total: typically < 5% latency overhead for well-implemented tracing.

How do I balance detail vs. cost?

StrategyTrade-off
Sampling10% of traces in full detail
Tiered detailMore detail for failures
AggregationMetrics vs. full traces
RetentionShorter window for full traces

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Let's build
something real.

No more slide decks. No more "maybe next quarter".
Let's ship your MVP in weeks.

Start Building Now