Agent Observability in 2026: Traces, Costs, and Failure Modes
If you can't see why an agent failed, you can't fix it. A practical observability model for tool-using agents: traces, spans, replay, and the metrics that matter.
TL;DR
- You need traces for: model calls, tool calls, retrieval, and UI actions
- Store enough detail to replay failures — not just logs, but reproducible state
- Key metrics: success rate per workflow, escalation rate, p95 latency, cost per successful completion
- OpenTelemetry is emerging as the standard for unified agent observability
- LangSmith and similar platforms provide LLM-specific visualization and debugging
- Chat transcripts are not enough — you need structured traces to diagnose tool errors
Why Agent Observability Is Different
Traditional application monitoring doesn’t capture what makes agents unique:
| Traditional App | Agent System |
|---|---|
| Request → Response | Request → Planning → Tools → Response |
| Deterministic | Non-deterministic |
| Clear error states | ”It answered, but wrongly” |
| Stack traces | Decision traces |
| Code bugs | Prompt/context/tool bugs |
The Debugging Challenge
When an agent fails, you need to answer:
- What was the user’s actual intent?
- What context did the agent have?
- Which tools were called with what parameters?
- What did each tool return?
- How did the agent interpret those results?
- Why did it produce this specific output?
Without observability, you’re guessing.
What an Agent Trace Should Include
The Minimum Trace Model
| Layer | What to Capture |
|---|---|
| Request | User input, session context, user state |
| Intent | Classified intent, confidence, alternatives |
| Planning | Steps planned, reasoning (if available) |
| Tools | Each call: name, inputs, outputs, timing |
| Retrieval | Query, docs retrieved, relevance scores |
| Generation | Prompt assembled, tokens used, response |
| Validation | Checks run, results, failures |
| Output | Final response, schema validation |
Trace Structure
Trace: agent-request-12345
├── Span: intent-classification (15ms)
│ ├── input: "Refund my order 12345"
│ ├── intent: "refund_request"
│ └── confidence: 0.94
├── Span: context-retrieval (120ms)
│ ├── query: "refund policy order 12345"
│ ├── docs_retrieved: 3
│ └── relevance_scores: [0.89, 0.72, 0.68]
├── Span: tool-call-get_order (85ms)
│ ├── tool: "get_order_status"
│ ├── params: {"order_id": "12345"}
│ └── result: {"status": "delivered", ...}
├── Span: tool-call-initiate_refund (150ms)
│ ├── tool: "initiate_refund"
│ ├── params: {"order_id": "12345", "amount": 99.99}
│ └── result: {"success": true, "refund_id": "ref_789"}
├── Span: response-generation (200ms)
│ ├── tokens_in: 1250
│ ├── tokens_out: 180
│ └── model: "gpt-4o"
└── Span: validation (5ms)
├── schema_valid: true
└── policy_compliant: true
What to Store vs. Redact
| Store | Redact |
|---|---|
| Intent and classification | PII in user messages |
| Tool names and timing | Sensitive tool parameters |
| Success/failure status | API keys, secrets |
| Token counts | Full conversation (store reference) |
| Error messages | Credit card numbers |
The Metrics That Actually Matter
Don’t drown in data. Focus on what drives decisions.
Primary Metrics
| Metric | Definition | Why It Matters |
|---|---|---|
| Success rate (per workflow) | Completed / Attempted | Core reliability |
| Escalation rate | Escalated / Total | Agent capability |
| P95 latency (per workflow) | 95th percentile duration | User experience |
| Cost per successful completion | Total cost / Successes | Unit economics |
Secondary Metrics
| Metric | Definition | Purpose |
|---|---|---|
| Tool call success rate | Successful calls / Total | Tool reliability |
| Retrieval precision | Relevant docs / Retrieved | RAG quality |
| Schema validation rate | Valid / Total | Output quality |
| Policy violation rate | Violations / Total | Safety |
| Retry rate | Retries / Attempts | Error handling |
Metric Hierarchy
North Star: Successful task completion rate
↓
Supporting: Escalation rate, User satisfaction
↓
Diagnostic: Tool success, Latency, Cost
↓
Granular: Token usage, Cache hit rate, Model distribution
Cost Per Outcome
Track not just token cost, but total cost per successful outcome:
Cost per success = (
Model costs +
Tool API costs +
Retrieval costs +
Infrastructure costs
) / Successful completions
This reveals whether your optimizations actually improve unit economics.
Replay: The Debugging Superpower
When something breaks, you want to reproduce it exactly.
What Enables Replay
| Component | Requirement |
|---|---|
| User input | Exact message, context |
| System state | User info, permissions, prior conversation |
| Tool outputs | Exact responses (cached) |
| Retrieval results | Exact documents (snapshot) |
| Model version | Which model at what time |
| Configuration | Prompt versions, parameters |
Replay Architecture
Production Request
↓
Capture full context
↓
Store in trace store
↓
On failure investigation:
↓
Replay engine loads:
- Original inputs
- Cached tool outputs
- Document snapshots
↓
Run agent with original context
↓
Compare: Same output? Different?
↓
Test fixes safely
Replay vs. Re-run
| Re-run | Replay |
|---|---|
| Call live tools again | Use cached tool outputs |
| May get different results | Exactly reproducible |
| Side effects possible | Side-effect free |
| Tests current state | Tests historical state |
OpenTelemetry for Agents
OpenTelemetry (OTel) is emerging as the standard for unified agent observability.
Why OpenTelemetry
| Benefit | Description |
|---|---|
| Unified standard | One format across all components |
| Distributed tracing | Follow requests across services |
| Interoperability | Works with Datadog, Grafana, Jaeger, etc. |
| Context propagation | Link related spans automatically |
| Rich semantic conventions | Standard meanings for common operations |
OpenTelemetry + LangSmith
LangSmith now supports end-to-end OpenTelemetry:
- LangChain instrumentation generates detailed traces
- LangSmith SDK converts to OpenTelemetry format
- Platform ingests and visualizes with LLM-specific features
Semantic Conventions for AI
Emerging standards for AI-specific spans:
| Span Type | Attributes |
|---|---|
llm.completion | model, tokens_in, tokens_out, temperature |
tool.call | tool_name, parameters, result_status |
retrieval.query | query, doc_count, relevance |
agent.step | step_type, confidence, decision |
Observability Tools Landscape
LangSmith
Best for: LangChain/LangGraph ecosystems, dedicated LLM debugging
| Feature | Description |
|---|---|
| Trace visualization | Hierarchical span view |
| Playground | Re-run with modified prompts |
| Datasets | Test against golden examples |
| Monitoring | Real-time dashboards |
| Annotations | Human feedback on traces |
Generic Observability + LLM Extensions
| Tool | Approach |
|---|---|
| Datadog | Add LLM-specific metrics/traces |
| Grafana | Custom dashboards for agent metrics |
| Honeycomb | Event-based debugging |
Custom Implementation
For specialized needs:
Application Layer
↓
Custom instrumentation (trace spans)
↓
Export to:
- Object storage (S3) for raw traces
- Time-series DB for metrics
- Search (Elasticsearch) for log analysis
↓
Visualization dashboard
Alert Design for Agents
What to Alert On
| Condition | Priority | Action |
|---|---|---|
| Success rate < 90% | Critical | Page on-call |
| P95 latency > 10s | High | Investigate |
| Tool failure rate > 5% | High | Check tool health |
| Cost spike > 2x | Medium | Review for anomaly |
| Escalation rate > 20% | Medium | Check agent quality |
| Policy violation | Critical | Immediate review |
Alert Fatigue Prevention
| Strategy | Implementation |
|---|---|
| Deduplicate | Group related alerts |
| Threshold hysteresis | Require sustained state change |
| Severity tiers | Critical/High/Medium/Low actions differ |
| Runbooks | Each alert has clear response steps |
Anomaly Detection
For agents, rule-based alerts aren’t enough. Watch for:
- Sudden distribution shifts in tool usage
- Unusual token consumption patterns
- Confidence score distribution changes
- New error types appearing
Debugging Workflows
Investigation Flow
Alert: Success rate dropped
↓
1. Check time range: When did it start?
↓
2. Check scope: All workflows or specific?
↓
3. Find failing traces
↓
4. Analyze failure patterns:
- Same tool failing?
- Same error type?
- Same user segment?
↓
5. Replay representative failures
↓
6. Identify root cause:
- Tool issue
- Prompt issue
- Data issue
- Model issue
↓
7. Fix and verify
↓
8. Post-incident review
Common Failure Patterns
| Pattern | Signs | Investigation |
|---|---|---|
| Tool degradation | Tool call latency/errors spike | Check tool health |
| RAG quality drop | Low relevance scores | Check index freshness |
| Prompt regression | New prompt version correlates | Compare prompt versions |
| Model change | Provider update timeline matches | Check model behavior |
| Data issue | Specific users/inputs affected | Check data quality |
Implementation Checklist
Instrumentation:
- Add trace spans for all major operations
- Capture tool calls with parameters and results
- Log retrieval queries and results
- Record token usage per call
- Store validation results
Storage:
- Set up trace storage with retention policy
- Implement PII redaction
- Configure replay capability
- Set up cost tracking
Metrics:
- Define primary KPIs (success, latency, cost)
- Create per-workflow dashboards
- Implement trend tracking
- Set up metric export
Alerting:
- Configure critical alerts (success rate, errors)
- Set up anomaly detection
- Create alert runbooks
- Test alert pathways
Integration:
- Choose observability platform
- Implement OpenTelemetry export
- Connect to existing infrastructure
- Set up cross-service correlation
FAQ
Is “chat transcripts” enough?
No. You need structured traces to diagnose tool errors and routing mistakes. Chat transcripts show the conversation but not the internal decisions, tool calls, or retrieval that produced the response.
How much trace data should I store?
| Retention | What to Keep |
|---|---|
| 7 days | Full traces, all spans |
| 30 days | Summary traces, sampled full |
| 90 days | Metrics and error traces only |
| Permanent | Incidents and golden tests |
Do I need OpenTelemetry?
Not strictly, but it provides:
- Standard format that works across tools
- Distributed tracing capability
- Growing ecosystem support
- Future-proof investment
How do I trace across multiple agents?
Use distributed tracing with context propagation:
- Pass trace ID between agents
- Use parent-child span relationships
- Correlate by session ID
What’s the overhead of tracing?
| Component | Overhead |
|---|---|
| Span creation | < 1ms typically |
| Context propagation | Negligible |
| Export | Batched, async |
| Storage | Main cost driver |
Total: typically < 5% latency overhead for well-implemented tracing.
How do I balance detail vs. cost?
| Strategy | Trade-off |
|---|---|
| Sampling | 10% of traces in full detail |
| Tiered detail | More detail for failures |
| Aggregation | Metrics vs. full traces |
| Retention | Shorter window for full traces |
Sources & Further Reading
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch