Agents #observability#agents#tracing

Agent Observability in 2026: Traces, Costs, and Failure Modes

If you can't see why an agent failed, you can't fix it. A practical observability model for tool-using agents: traces, spans, replay, and the metrics that matter.

14 min · January 4, 2026 · Updated January 27, 2026

TL;DR

You need traces for: model calls, tool calls, retrieval, and UI actions
Store enough detail to replay failures — not just logs, but reproducible state
Key metrics: success rate per workflow, escalation rate, p95 latency, cost per successful completion
OpenTelemetry is emerging as the standard for unified agent observability
LangSmith and similar platforms provide LLM-specific visualization and debugging
Chat transcripts are not enough — you need structured traces to diagnose tool errors

Why Agent Observability Is Different

Traditional application monitoring doesn’t capture what makes agents unique:

Traditional App	Agent System
Request → Response	Request → Planning → Tools → Response
Deterministic	Non-deterministic
Clear error states	”It answered, but wrongly”
Stack traces	Decision traces
Code bugs	Prompt/context/tool bugs

The Debugging Challenge

When an agent fails, you need to answer:

What was the user’s actual intent?
What context did the agent have?
Which tools were called with what parameters?
What did each tool return?
How did the agent interpret those results?
Why did it produce this specific output?

Without observability, you’re guessing.

What an Agent Trace Should Include

The Minimum Trace Model

Layer	What to Capture
Request	User input, session context, user state
Intent	Classified intent, confidence, alternatives
Planning	Steps planned, reasoning (if available)
Tools	Each call: name, inputs, outputs, timing
Retrieval	Query, docs retrieved, relevance scores
Generation	Prompt assembled, tokens used, response
Validation	Checks run, results, failures
Output	Final response, schema validation

Trace Structure

Trace: agent-request-12345
├── Span: intent-classification (15ms)
│   ├── input: "Refund my order 12345"
│   ├── intent: "refund_request"
│   └── confidence: 0.94
├── Span: context-retrieval (120ms)
│   ├── query: "refund policy order 12345"
│   ├── docs_retrieved: 3
│   └── relevance_scores: [0.89, 0.72, 0.68]
├── Span: tool-call-get_order (85ms)
│   ├── tool: "get_order_status"
│   ├── params: {"order_id": "12345"}
│   └── result: {"status": "delivered", ...}
├── Span: tool-call-initiate_refund (150ms)
│   ├── tool: "initiate_refund"
│   ├── params: {"order_id": "12345", "amount": 99.99}
│   └── result: {"success": true, "refund_id": "ref_789"}
├── Span: response-generation (200ms)
│   ├── tokens_in: 1250
│   ├── tokens_out: 180
│   └── model: "gpt-4o"
└── Span: validation (5ms)
    ├── schema_valid: true
    └── policy_compliant: true

What to Store vs. Redact

Store	Redact
Intent and classification	PII in user messages
Tool names and timing	Sensitive tool parameters
Success/failure status	API keys, secrets
Token counts	Full conversation (store reference)
Error messages	Credit card numbers

The Metrics That Actually Matter

Don’t drown in data. Focus on what drives decisions.

Primary Metrics

Metric	Definition	Why It Matters
Success rate (per workflow)	Completed / Attempted	Core reliability
Escalation rate	Escalated / Total	Agent capability
P95 latency (per workflow)	95th percentile duration	User experience
Cost per successful completion	Total cost / Successes	Unit economics

Secondary Metrics

Metric	Definition	Purpose
Tool call success rate	Successful calls / Total	Tool reliability
Retrieval precision	Relevant docs / Retrieved	RAG quality
Schema validation rate	Valid / Total	Output quality
Policy violation rate	Violations / Total	Safety
Retry rate	Retries / Attempts	Error handling

Metric Hierarchy

North Star: Successful task completion rate
    ↓
Supporting: Escalation rate, User satisfaction
    ↓
Diagnostic: Tool success, Latency, Cost
    ↓
Granular: Token usage, Cache hit rate, Model distribution

Cost Per Outcome

Track not just token cost, but total cost per successful outcome:

Cost per success = (
    Model costs +
    Tool API costs +
    Retrieval costs +
    Infrastructure costs
) / Successful completions

This reveals whether your optimizations actually improve unit economics.

Replay: The Debugging Superpower

When something breaks, you want to reproduce it exactly.

What Enables Replay

Component	Requirement
User input	Exact message, context
System state	User info, permissions, prior conversation
Tool outputs	Exact responses (cached)
Retrieval results	Exact documents (snapshot)
Model version	Which model at what time
Configuration	Prompt versions, parameters

Replay Architecture

Production Request
       ↓
Capture full context
       ↓
Store in trace store
       ↓
On failure investigation:
       ↓
Replay engine loads:
  - Original inputs
  - Cached tool outputs
  - Document snapshots
       ↓
Run agent with original context
       ↓
Compare: Same output? Different?
       ↓
Test fixes safely

Replay vs. Re-run

Re-run	Replay
Call live tools again	Use cached tool outputs
May get different results	Exactly reproducible
Side effects possible	Side-effect free
Tests current state	Tests historical state

OpenTelemetry for Agents

OpenTelemetry (OTel) is emerging as the standard for unified agent observability.

Why OpenTelemetry

Benefit	Description
Unified standard	One format across all components
Distributed tracing	Follow requests across services
Interoperability	Works with Datadog, Grafana, Jaeger, etc.
Context propagation	Link related spans automatically
Rich semantic conventions	Standard meanings for common operations

OpenTelemetry + LangSmith

LangSmith now supports end-to-end OpenTelemetry:

LangChain instrumentation generates detailed traces
LangSmith SDK converts to OpenTelemetry format
Platform ingests and visualizes with LLM-specific features

Semantic Conventions for AI

Emerging standards for AI-specific spans:

Span Type	Attributes
`llm.completion`	model, tokens_in, tokens_out, temperature
`tool.call`	tool_name, parameters, result_status
`retrieval.query`	query, doc_count, relevance
`agent.step`	step_type, confidence, decision

Observability Tools Landscape

LangSmith

Best for: LangChain/LangGraph ecosystems, dedicated LLM debugging

Feature	Description
Trace visualization	Hierarchical span view
Playground	Re-run with modified prompts
Datasets	Test against golden examples
Monitoring	Real-time dashboards
Annotations	Human feedback on traces

Generic Observability + LLM Extensions

Tool	Approach
Datadog	Add LLM-specific metrics/traces
Grafana	Custom dashboards for agent metrics
Honeycomb	Event-based debugging

Custom Implementation

For specialized needs:

Application Layer
       ↓
Custom instrumentation (trace spans)
       ↓
Export to:
  - Object storage (S3) for raw traces
  - Time-series DB for metrics
  - Search (Elasticsearch) for log analysis
       ↓
Visualization dashboard

Alert Design for Agents

What to Alert On

Condition	Priority	Action
Success rate < 90%	Critical	Page on-call
P95 latency > 10s	High	Investigate
Tool failure rate > 5%	High	Check tool health
Cost spike > 2x	Medium	Review for anomaly
Escalation rate > 20%	Medium	Check agent quality
Policy violation	Critical	Immediate review

Alert Fatigue Prevention

Strategy	Implementation
Deduplicate	Group related alerts
Threshold hysteresis	Require sustained state change
Severity tiers	Critical/High/Medium/Low actions differ
Runbooks	Each alert has clear response steps

Anomaly Detection

For agents, rule-based alerts aren’t enough. Watch for:

Sudden distribution shifts in tool usage
Unusual token consumption patterns
Confidence score distribution changes
New error types appearing

Debugging Workflows

Investigation Flow

Alert: Success rate dropped
       ↓
1. Check time range: When did it start?
       ↓
2. Check scope: All workflows or specific?
       ↓
3. Find failing traces
       ↓
4. Analyze failure patterns:
   - Same tool failing?
   - Same error type?
   - Same user segment?
       ↓
5. Replay representative failures
       ↓
6. Identify root cause:
   - Tool issue
   - Prompt issue
   - Data issue
   - Model issue
       ↓
7. Fix and verify
       ↓
8. Post-incident review

Common Failure Patterns

Pattern	Signs	Investigation
Tool degradation	Tool call latency/errors spike	Check tool health
RAG quality drop	Low relevance scores	Check index freshness
Prompt regression	New prompt version correlates	Compare prompt versions
Model change	Provider update timeline matches	Check model behavior
Data issue	Specific users/inputs affected	Check data quality

Implementation Checklist

Instrumentation:

Add trace spans for all major operations
Capture tool calls with parameters and results
Log retrieval queries and results
Record token usage per call
Store validation results

Storage:

Set up trace storage with retention policy
Implement PII redaction
Configure replay capability
Set up cost tracking

Metrics:

Define primary KPIs (success, latency, cost)
Create per-workflow dashboards
Implement trend tracking
Set up metric export

Alerting:

Configure critical alerts (success rate, errors)
Set up anomaly detection
Create alert runbooks
Test alert pathways

Integration:

Choose observability platform
Implement OpenTelemetry export
Connect to existing infrastructure
Set up cross-service correlation

FAQ

Is “chat transcripts” enough?

No. You need structured traces to diagnose tool errors and routing mistakes. Chat transcripts show the conversation but not the internal decisions, tool calls, or retrieval that produced the response.

How much trace data should I store?

Retention	What to Keep
7 days	Full traces, all spans
30 days	Summary traces, sampled full
90 days	Metrics and error traces only
Permanent	Incidents and golden tests

Do I need OpenTelemetry?

Not strictly, but it provides:

Standard format that works across tools
Distributed tracing capability
Growing ecosystem support
Future-proof investment

How do I trace across multiple agents?

Use distributed tracing with context propagation:

Pass trace ID between agents
Use parent-child span relationships
Correlate by session ID

What’s the overhead of tracing?

Component	Overhead
Span creation	< 1ms typically
Context propagation	Negligible
Export	Batched, async
Storage	Main cost driver

Total: typically < 5% latency overhead for well-implemented tracing.

How do I balance detail vs. cost?

Strategy	Trade-off
Sampling	10% of traces in full detail
Tiered detail	More detail for failures
Aggregation	Metrics vs. full traces
Retention	Shorter window for full traces

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch