Agents #agents#reliability#timeouts

Tool Timeouts and Retries in 2026: Building Resilient AI Agent Infrastructure

When AI tools fail, your agent needs a plan. A practical guide to timeout configuration, retry strategies, and graceful degradation for production agents.

14 min · January 6, 2026 · Updated January 27, 2026

TL;DR

Set timeouts on all remote calls—hanging requests exhaust resources and block users indefinitely.
Use exponential backoff with jitter for retries—prevents thundering herd when services recover.
Retry only on transient errors (429, 5xx)—retrying on 400s or 404s wastes resources.
Implement circuit breakers to fail fast when dependencies are unhealthy.
Design tools to be idempotent so retries are safe.
Define fallback behaviors: cache, degrade gracefully, or fail with a clear error.
Monitor retry rates and timeout frequency—they’re leading indicators of system health.

The Problem: Unbounded Waits

When an AI agent calls a tool, what happens if that tool never responds? Without timeouts:

The agent hangs indefinitely
User stares at a spinner
Resources (connections, threads) are exhausted
Other requests queue behind the stuck one
Eventually, the entire system degrades

Timeouts are the first line of defense. Retries are the second. Together with circuit breakers and fallbacks, they form a resilient system.

Timeout Configuration

Timeout Layers

Configure timeouts at multiple levels:

Layer	Purpose	Typical Value
Connection timeout	Time to establish connection	3–5 seconds
Read timeout	Time waiting for first byte	10–30 seconds
Total timeout	End-to-end request limit	30–60 seconds
Agent turn timeout	Total time for agent to complete	2–5 minutes

Setting Appropriate Values

Base timeouts on observed latency + buffer:

// Analyze historical latency
const latencyStats = await getToolLatencyStats('search_api', {
  period: '7d',
  percentiles: [50, 95, 99],
});

// Result: { p50: 200ms, p95: 800ms, p99: 2000ms }

// Set timeout to p99 + buffer (2x p99 is reasonable)
const timeout = latencyStats.p99 * 2; // 4000ms

Per-Tool Timeout Configuration

Different tools need different timeouts:

const toolTimeouts: Record<string, number> = {
  // Fast lookups
  'get_user': 5_000,
  'check_inventory': 5_000,
  
  // API calls
  'search_web': 15_000,
  'call_external_api': 30_000,
  
  // Long-running operations
  'generate_report': 60_000,
  'run_analysis': 120_000,
};

async function callTool(name: string, params: Record<string, unknown>) {
  const timeout = toolTimeouts[name] ?? 30_000; // Default 30s
  
  return withTimeout(
    executeTool(name, params),
    timeout,
    `Tool ${name} timed out after ${timeout}ms`
  );
}

Timeout Helper

function withTimeout<T>(
  promise: Promise<T>,
  ms: number,
  errorMessage: string
): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) => 
      setTimeout(() => reject(new TimeoutError(errorMessage)), ms)
    ),
  ]);
}

class TimeoutError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'TimeoutError';
  }
}

Retry Strategies

What to Retry

Error Type	Retry?	Reasoning
429 (Rate limit)	Yes, with backoff	Transient, will succeed later
500 (Internal error)	Yes, limited	May be transient
502, 503, 504	Yes, with backoff	Infrastructure issues often resolve
Timeout	Yes, once	Connection may have been interrupted
400 (Bad request)	No	Input is wrong, won’t change
401, 403	No	Auth issue, won’t change
404	No	Resource doesn’t exist

Exponential Backoff with Jitter

Simple backoff:

Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 2 seconds
Attempt 4: wait 4 seconds

Problem: If 1,000 clients retry at the same time, they all hit the server together at each interval.

Solution: Add jitter (randomization):

function calculateBackoff(attempt: number, options: {
  baseDelay: number;
  maxDelay: number;
  jitterFactor: number;
}): number {
  const { baseDelay, maxDelay, jitterFactor } = options;
  
  // Exponential: 1s, 2s, 4s, 8s...
  const exponentialDelay = Math.min(
    baseDelay * Math.pow(2, attempt - 1),
    maxDelay
  );
  
  // Add jitter: ±25% randomization
  const jitter = exponentialDelay * jitterFactor * (Math.random() * 2 - 1);
  
  return Math.max(0, exponentialDelay + jitter);
}

// Usage
const backoff = calculateBackoff(attempt, {
  baseDelay: 1000,      // 1 second base
  maxDelay: 30000,      // 30 second max
  jitterFactor: 0.25,   // ±25% jitter
});

Complete Retry Implementation

interface RetryOptions {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  retryableErrors: number[];
  onRetry?: (attempt: number, error: Error) => void;
}

async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions
): Promise<T> {
  const {
    maxAttempts,
    baseDelay,
    maxDelay,
    retryableErrors,
    onRetry,
  } = options;
  
  let lastError: Error;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      
      // Check if retryable
      const statusCode = (error as any).status ?? (error as any).statusCode;
      const isRetryable = 
        error instanceof TimeoutError ||
        retryableErrors.includes(statusCode);
      
      if (!isRetryable || attempt === maxAttempts) {
        throw error;
      }
      
      // Calculate backoff
      const delay = calculateBackoff(attempt, { baseDelay, maxDelay, jitterFactor: 0.25 });
      
      onRetry?.(attempt, error as Error);
      
      await sleep(delay);
    }
  }
  
  throw lastError!;
}

// Usage
const result = await withRetry(
  () => callExternalApi(params),
  {
    maxAttempts: 3,
    baseDelay: 1000,
    maxDelay: 10000,
    retryableErrors: [429, 500, 502, 503, 504],
    onRetry: (attempt, error) => {
      console.log(`Retry ${attempt} after error: ${error.message}`);
    },
  }
);

Circuit Breakers

The Pattern

When a dependency fails repeatedly, stop calling it temporarily:

CLOSED (normal)
    │
    ├── Failure threshold exceeded
    ▼
OPEN (fast fail)
    │
    ├── After timeout period
    ▼
HALF-OPEN (testing)
    │
    ├── Success → CLOSED
    └── Failure → OPEN

Implementation

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures = 0;
  private lastFailure?: Date;
  private readonly threshold: number;
  private readonly resetTimeout: number;
  
  constructor(options: { threshold: number; resetTimeout: number }) {
    this.threshold = options.threshold;
    this.resetTimeout = options.resetTimeout;
  }
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    // Check if circuit should transition from open to half-open
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure!.getTime() > this.resetTimeout) {
        this.state = 'half-open';
      } else {
        throw new CircuitOpenError('Circuit is open');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }
  
  private onFailure() {
    this.failures++;
    this.lastFailure = new Date();
    
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
  
  getState() {
    return this.state;
  }
}

// Usage
const searchCircuit = new CircuitBreaker({
  threshold: 5,        // Open after 5 failures
  resetTimeout: 30000, // Try again after 30 seconds
});

async function searchWithCircuitBreaker(query: string) {
  return searchCircuit.call(() => searchApi(query));
}

Per-Tool Circuit Breakers

const circuits = new Map<string, CircuitBreaker>();

function getCircuit(toolName: string): CircuitBreaker {
  if (!circuits.has(toolName)) {
    circuits.set(toolName, new CircuitBreaker({
      threshold: 5,
      resetTimeout: 30000,
    }));
  }
  return circuits.get(toolName)!;
}

async function callTool(name: string, params: Record<string, unknown>) {
  const circuit = getCircuit(name);
  
  return circuit.call(() => 
    withRetry(
      () => withTimeout(executeTool(name, params), toolTimeouts[name]),
      retryOptions
    )
  );
}

Fallback Strategies

When all else fails, have a plan:

Strategy 1: Cached Results

async function searchWithFallback(query: string) {
  try {
    const result = await searchApi(query);
    await cache.set(`search:${query}`, result, { ttl: 3600 });
    return result;
  } catch (error) {
    // Try cache
    const cached = await cache.get(`search:${query}`);
    if (cached) {
      return { ...cached, stale: true };
    }
    throw error;
  }
}

Strategy 2: Degraded Functionality

async function getRecommendations(userId: string) {
  try {
    return await recommendationService.getPersonalized(userId);
  } catch (error) {
    // Fall back to popular items
    return await popularItemsCache.get();
  }
}

Strategy 3: Graceful Error

async function toolWithFallback(name: string, params: Record<string, unknown>) {
  try {
    return await callTool(name, params);
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      return {
        success: false,
        error_type: 'service_unavailable',
        error_message: `${name} is temporarily unavailable`,
        retry_after: 30,
        fallback_used: true,
      };
    }
    throw error;
  }
}

Monitoring and Alerting

Key Metrics

Metric	What It Shows	Alert Threshold
Timeout rate	% of requests timing out	>1%
Retry rate	% of requests requiring retry	>5%
Circuit open events	Dependency failures	Any occurrence
P99 latency	Tail latency	>3x baseline
Error rate by type	Breakdown of failures	Varies

Logging

interface ToolExecutionLog {
  tool_name: string;
  timestamp: string;
  duration_ms: number;
  success: boolean;
  attempts: number;
  timeout_occurred: boolean;
  circuit_state: string;
  error_type?: string;
  fallback_used: boolean;
}

function logToolExecution(log: ToolExecutionLog) {
  // Structured logging for analysis
  logger.info('tool_execution', log);
  
  // Metrics
  metrics.histogram('tool_duration', log.duration_ms, { tool: log.tool_name });
  metrics.counter('tool_calls', 1, { 
    tool: log.tool_name, 
    success: log.success,
    retried: log.attempts > 1,
  });
  
  if (log.timeout_occurred) {
    metrics.counter('tool_timeouts', 1, { tool: log.tool_name });
  }
}

Implementation Checklist

Configuration

Set connection timeout for all HTTP clients
Set read timeout for all HTTP clients
Define per-tool timeout values based on observed latency
Configure retry policies (max attempts, backoff)
Define retryable error codes

Implementation

Implement timeout wrapper for all tool calls
Implement retry logic with exponential backoff + jitter
Add circuit breakers for each external dependency
Design fallback strategies (cache, degrade, error)
Ensure all tools are idempotent for safe retries

Monitoring

Track timeout rates per tool
Track retry rates per tool
Alert on circuit breaker state changes
Monitor P99 latency trends
Log all retry attempts with context

FAQ

What should the default timeout be?

30 seconds is a reasonable default. Adjust based on observed latency. Some tools (long-running reports) need longer; some (cache lookups) should be shorter.

How many retries are appropriate?

2–3 retries for most operations. More retries delay failure notification to users. For critical operations, consider 5 retries with longer backoff.

Should I retry on timeouts?

Yes, once, with a longer timeout. The request may have succeeded but the response was lost. Ensure idempotency to avoid duplicate actions.

When should circuit breakers open?

After 3–5 consecutive failures or >50% failure rate in a time window. Tune based on your reliability requirements and dependency characteristics.

How do I handle partial failures?

For batch operations, continue processing remaining items and return a result with both successes and failures. Don’t fail the entire batch for one item.

What about rate limiting from external APIs?

Handle 429 responses with the Retry-After header. Implement client-side rate limiting to avoid hitting limits. Consider request queuing for high-volume tools.

Sources & Further Reading

Agent Retry Strategies — Agent-specific patterns
LLM Retries & Error Handling — Gateway-level handling
Portkey Request Timeouts — Timeout configuration
AWS: Timeouts, Retries, and Backoff — AWS best practices
Agent Observability — Related: monitoring agent behavior
AI Incident Response — Related: handling failures at scale

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Tool Timeouts and Retries in 2026: Building Resilient AI Agent Infrastructure

TL;DR

The Problem: Unbounded Waits

Timeout Configuration

Timeout Layers

Setting Appropriate Values

Per-Tool Timeout Configuration

Timeout Helper

Retry Strategies

What to Retry

Exponential Backoff with Jitter

Complete Retry Implementation

Circuit Breakers

The Pattern

Implementation

Per-Tool Circuit Breakers

Fallback Strategies

Strategy 1: Cached Results

Strategy 2: Degraded Functionality

Strategy 3: Graceful Error

Monitoring and Alerting

Key Metrics

Logging

Implementation Checklist

Configuration

Implementation

Monitoring

FAQ

What should the default timeout be?

How many retries are appropriate?

Should I retry on timeouts?

When should circuit breakers open?

How do I handle partial failures?

What about rate limiting from external APIs?

Sources & Further Reading

Interested in our research?

More Articles

Agent Economics in 2026: Cost, Latency, and the Business Model

Agentic Workflow Design in 2026: How to Turn Automation Into Outcomes

Agent Routing Strategies in 2026: The Router Is the Product

Let's build something real.

Let's build
something real.