Tool Timeouts and Retries in 2026: Building Resilient AI Agent Infrastructure
When AI tools fail, your agent needs a plan. A practical guide to timeout configuration, retry strategies, and graceful degradation for production agents.
TL;DR
- Set timeouts on all remote calls—hanging requests exhaust resources and block users indefinitely.
- Use exponential backoff with jitter for retries—prevents thundering herd when services recover.
- Retry only on transient errors (429, 5xx)—retrying on 400s or 404s wastes resources.
- Implement circuit breakers to fail fast when dependencies are unhealthy.
- Design tools to be idempotent so retries are safe.
- Define fallback behaviors: cache, degrade gracefully, or fail with a clear error.
- Monitor retry rates and timeout frequency—they’re leading indicators of system health.
The Problem: Unbounded Waits
When an AI agent calls a tool, what happens if that tool never responds? Without timeouts:
- The agent hangs indefinitely
- User stares at a spinner
- Resources (connections, threads) are exhausted
- Other requests queue behind the stuck one
- Eventually, the entire system degrades
Timeouts are the first line of defense. Retries are the second. Together with circuit breakers and fallbacks, they form a resilient system.
Timeout Configuration
Timeout Layers
Configure timeouts at multiple levels:
| Layer | Purpose | Typical Value |
|---|---|---|
| Connection timeout | Time to establish connection | 3–5 seconds |
| Read timeout | Time waiting for first byte | 10–30 seconds |
| Total timeout | End-to-end request limit | 30–60 seconds |
| Agent turn timeout | Total time for agent to complete | 2–5 minutes |
Setting Appropriate Values
Base timeouts on observed latency + buffer:
// Analyze historical latency
const latencyStats = await getToolLatencyStats('search_api', {
period: '7d',
percentiles: [50, 95, 99],
});
// Result: { p50: 200ms, p95: 800ms, p99: 2000ms }
// Set timeout to p99 + buffer (2x p99 is reasonable)
const timeout = latencyStats.p99 * 2; // 4000ms
Per-Tool Timeout Configuration
Different tools need different timeouts:
const toolTimeouts: Record<string, number> = {
// Fast lookups
'get_user': 5_000,
'check_inventory': 5_000,
// API calls
'search_web': 15_000,
'call_external_api': 30_000,
// Long-running operations
'generate_report': 60_000,
'run_analysis': 120_000,
};
async function callTool(name: string, params: Record<string, unknown>) {
const timeout = toolTimeouts[name] ?? 30_000; // Default 30s
return withTimeout(
executeTool(name, params),
timeout,
`Tool ${name} timed out after ${timeout}ms`
);
}
Timeout Helper
function withTimeout<T>(
promise: Promise<T>,
ms: number,
errorMessage: string
): Promise<T> {
return Promise.race([
promise,
new Promise<never>((_, reject) =>
setTimeout(() => reject(new TimeoutError(errorMessage)), ms)
),
]);
}
class TimeoutError extends Error {
constructor(message: string) {
super(message);
this.name = 'TimeoutError';
}
}
Retry Strategies
What to Retry
| Error Type | Retry? | Reasoning |
|---|---|---|
| 429 (Rate limit) | Yes, with backoff | Transient, will succeed later |
| 500 (Internal error) | Yes, limited | May be transient |
| 502, 503, 504 | Yes, with backoff | Infrastructure issues often resolve |
| Timeout | Yes, once | Connection may have been interrupted |
| 400 (Bad request) | No | Input is wrong, won’t change |
| 401, 403 | No | Auth issue, won’t change |
| 404 | No | Resource doesn’t exist |
Exponential Backoff with Jitter
Simple backoff:
Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 2 seconds
Attempt 4: wait 4 seconds
Problem: If 1,000 clients retry at the same time, they all hit the server together at each interval.
Solution: Add jitter (randomization):
function calculateBackoff(attempt: number, options: {
baseDelay: number;
maxDelay: number;
jitterFactor: number;
}): number {
const { baseDelay, maxDelay, jitterFactor } = options;
// Exponential: 1s, 2s, 4s, 8s...
const exponentialDelay = Math.min(
baseDelay * Math.pow(2, attempt - 1),
maxDelay
);
// Add jitter: ±25% randomization
const jitter = exponentialDelay * jitterFactor * (Math.random() * 2 - 1);
return Math.max(0, exponentialDelay + jitter);
}
// Usage
const backoff = calculateBackoff(attempt, {
baseDelay: 1000, // 1 second base
maxDelay: 30000, // 30 second max
jitterFactor: 0.25, // ±25% jitter
});
Complete Retry Implementation
interface RetryOptions {
maxAttempts: number;
baseDelay: number;
maxDelay: number;
retryableErrors: number[];
onRetry?: (attempt: number, error: Error) => void;
}
async function withRetry<T>(
fn: () => Promise<T>,
options: RetryOptions
): Promise<T> {
const {
maxAttempts,
baseDelay,
maxDelay,
retryableErrors,
onRetry,
} = options;
let lastError: Error;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
// Check if retryable
const statusCode = (error as any).status ?? (error as any).statusCode;
const isRetryable =
error instanceof TimeoutError ||
retryableErrors.includes(statusCode);
if (!isRetryable || attempt === maxAttempts) {
throw error;
}
// Calculate backoff
const delay = calculateBackoff(attempt, { baseDelay, maxDelay, jitterFactor: 0.25 });
onRetry?.(attempt, error as Error);
await sleep(delay);
}
}
throw lastError!;
}
// Usage
const result = await withRetry(
() => callExternalApi(params),
{
maxAttempts: 3,
baseDelay: 1000,
maxDelay: 10000,
retryableErrors: [429, 500, 502, 503, 504],
onRetry: (attempt, error) => {
console.log(`Retry ${attempt} after error: ${error.message}`);
},
}
);
Circuit Breakers
The Pattern
When a dependency fails repeatedly, stop calling it temporarily:
CLOSED (normal)
│
├── Failure threshold exceeded
▼
OPEN (fast fail)
│
├── After timeout period
▼
HALF-OPEN (testing)
│
├── Success → CLOSED
└── Failure → OPEN
Implementation
class CircuitBreaker {
private state: 'closed' | 'open' | 'half-open' = 'closed';
private failures = 0;
private lastFailure?: Date;
private readonly threshold: number;
private readonly resetTimeout: number;
constructor(options: { threshold: number; resetTimeout: number }) {
this.threshold = options.threshold;
this.resetTimeout = options.resetTimeout;
}
async call<T>(fn: () => Promise<T>): Promise<T> {
// Check if circuit should transition from open to half-open
if (this.state === 'open') {
if (Date.now() - this.lastFailure!.getTime() > this.resetTimeout) {
this.state = 'half-open';
} else {
throw new CircuitOpenError('Circuit is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = new Date();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
getState() {
return this.state;
}
}
// Usage
const searchCircuit = new CircuitBreaker({
threshold: 5, // Open after 5 failures
resetTimeout: 30000, // Try again after 30 seconds
});
async function searchWithCircuitBreaker(query: string) {
return searchCircuit.call(() => searchApi(query));
}
Per-Tool Circuit Breakers
const circuits = new Map<string, CircuitBreaker>();
function getCircuit(toolName: string): CircuitBreaker {
if (!circuits.has(toolName)) {
circuits.set(toolName, new CircuitBreaker({
threshold: 5,
resetTimeout: 30000,
}));
}
return circuits.get(toolName)!;
}
async function callTool(name: string, params: Record<string, unknown>) {
const circuit = getCircuit(name);
return circuit.call(() =>
withRetry(
() => withTimeout(executeTool(name, params), toolTimeouts[name]),
retryOptions
)
);
}
Fallback Strategies
When all else fails, have a plan:
Strategy 1: Cached Results
async function searchWithFallback(query: string) {
try {
const result = await searchApi(query);
await cache.set(`search:${query}`, result, { ttl: 3600 });
return result;
} catch (error) {
// Try cache
const cached = await cache.get(`search:${query}`);
if (cached) {
return { ...cached, stale: true };
}
throw error;
}
}
Strategy 2: Degraded Functionality
async function getRecommendations(userId: string) {
try {
return await recommendationService.getPersonalized(userId);
} catch (error) {
// Fall back to popular items
return await popularItemsCache.get();
}
}
Strategy 3: Graceful Error
async function toolWithFallback(name: string, params: Record<string, unknown>) {
try {
return await callTool(name, params);
} catch (error) {
if (error instanceof CircuitOpenError) {
return {
success: false,
error_type: 'service_unavailable',
error_message: `${name} is temporarily unavailable`,
retry_after: 30,
fallback_used: true,
};
}
throw error;
}
}
Monitoring and Alerting
Key Metrics
| Metric | What It Shows | Alert Threshold |
|---|---|---|
| Timeout rate | % of requests timing out | >1% |
| Retry rate | % of requests requiring retry | >5% |
| Circuit open events | Dependency failures | Any occurrence |
| P99 latency | Tail latency | >3x baseline |
| Error rate by type | Breakdown of failures | Varies |
Logging
interface ToolExecutionLog {
tool_name: string;
timestamp: string;
duration_ms: number;
success: boolean;
attempts: number;
timeout_occurred: boolean;
circuit_state: string;
error_type?: string;
fallback_used: boolean;
}
function logToolExecution(log: ToolExecutionLog) {
// Structured logging for analysis
logger.info('tool_execution', log);
// Metrics
metrics.histogram('tool_duration', log.duration_ms, { tool: log.tool_name });
metrics.counter('tool_calls', 1, {
tool: log.tool_name,
success: log.success,
retried: log.attempts > 1,
});
if (log.timeout_occurred) {
metrics.counter('tool_timeouts', 1, { tool: log.tool_name });
}
}
Implementation Checklist
Configuration
- Set connection timeout for all HTTP clients
- Set read timeout for all HTTP clients
- Define per-tool timeout values based on observed latency
- Configure retry policies (max attempts, backoff)
- Define retryable error codes
Implementation
- Implement timeout wrapper for all tool calls
- Implement retry logic with exponential backoff + jitter
- Add circuit breakers for each external dependency
- Design fallback strategies (cache, degrade, error)
- Ensure all tools are idempotent for safe retries
Monitoring
- Track timeout rates per tool
- Track retry rates per tool
- Alert on circuit breaker state changes
- Monitor P99 latency trends
- Log all retry attempts with context
FAQ
What should the default timeout be?
30 seconds is a reasonable default. Adjust based on observed latency. Some tools (long-running reports) need longer; some (cache lookups) should be shorter.
How many retries are appropriate?
2–3 retries for most operations. More retries delay failure notification to users. For critical operations, consider 5 retries with longer backoff.
Should I retry on timeouts?
Yes, once, with a longer timeout. The request may have succeeded but the response was lost. Ensure idempotency to avoid duplicate actions.
When should circuit breakers open?
After 3–5 consecutive failures or >50% failure rate in a time window. Tune based on your reliability requirements and dependency characteristics.
How do I handle partial failures?
For batch operations, continue processing remaining items and return a result with both successes and failures. Don’t fail the entire batch for one item.
What about rate limiting from external APIs?
Handle 429 responses with the Retry-After header. Implement client-side rate limiting to avoid hitting limits. Consider request queuing for high-volume tools.
Sources & Further Reading
- Agent Retry Strategies — Agent-specific patterns
- LLM Retries & Error Handling — Gateway-level handling
- Portkey Request Timeouts — Timeout configuration
- AWS: Timeouts, Retries, and Backoff — AWS best practices
- Agent Observability — Related: monitoring agent behavior
- AI Incident Response — Related: handling failures at scale
Interested in our research?
We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.
Get in Touch