Back to blog
Agents #agents#reliability#timeouts

Tool Timeouts and Retries in 2026: Building Resilient AI Agent Infrastructure

When AI tools fail, your agent needs a plan. A practical guide to timeout configuration, retry strategies, and graceful degradation for production agents.

14 min · January 6, 2026 · Updated January 27, 2026
Topic relevant background image

TL;DR

  • Set timeouts on all remote calls—hanging requests exhaust resources and block users indefinitely.
  • Use exponential backoff with jitter for retries—prevents thundering herd when services recover.
  • Retry only on transient errors (429, 5xx)—retrying on 400s or 404s wastes resources.
  • Implement circuit breakers to fail fast when dependencies are unhealthy.
  • Design tools to be idempotent so retries are safe.
  • Define fallback behaviors: cache, degrade gracefully, or fail with a clear error.
  • Monitor retry rates and timeout frequency—they’re leading indicators of system health.

The Problem: Unbounded Waits

When an AI agent calls a tool, what happens if that tool never responds? Without timeouts:

  • The agent hangs indefinitely
  • User stares at a spinner
  • Resources (connections, threads) are exhausted
  • Other requests queue behind the stuck one
  • Eventually, the entire system degrades

Timeouts are the first line of defense. Retries are the second. Together with circuit breakers and fallbacks, they form a resilient system.

Timeout Configuration

Timeout Layers

Configure timeouts at multiple levels:

LayerPurposeTypical Value
Connection timeoutTime to establish connection3–5 seconds
Read timeoutTime waiting for first byte10–30 seconds
Total timeoutEnd-to-end request limit30–60 seconds
Agent turn timeoutTotal time for agent to complete2–5 minutes

Setting Appropriate Values

Base timeouts on observed latency + buffer:

// Analyze historical latency
const latencyStats = await getToolLatencyStats('search_api', {
  period: '7d',
  percentiles: [50, 95, 99],
});

// Result: { p50: 200ms, p95: 800ms, p99: 2000ms }

// Set timeout to p99 + buffer (2x p99 is reasonable)
const timeout = latencyStats.p99 * 2; // 4000ms

Per-Tool Timeout Configuration

Different tools need different timeouts:

const toolTimeouts: Record<string, number> = {
  // Fast lookups
  'get_user': 5_000,
  'check_inventory': 5_000,
  
  // API calls
  'search_web': 15_000,
  'call_external_api': 30_000,
  
  // Long-running operations
  'generate_report': 60_000,
  'run_analysis': 120_000,
};

async function callTool(name: string, params: Record<string, unknown>) {
  const timeout = toolTimeouts[name] ?? 30_000; // Default 30s
  
  return withTimeout(
    executeTool(name, params),
    timeout,
    `Tool ${name} timed out after ${timeout}ms`
  );
}

Timeout Helper

function withTimeout<T>(
  promise: Promise<T>,
  ms: number,
  errorMessage: string
): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) => 
      setTimeout(() => reject(new TimeoutError(errorMessage)), ms)
    ),
  ]);
}

class TimeoutError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'TimeoutError';
  }
}

Retry Strategies

What to Retry

Error TypeRetry?Reasoning
429 (Rate limit)Yes, with backoffTransient, will succeed later
500 (Internal error)Yes, limitedMay be transient
502, 503, 504Yes, with backoffInfrastructure issues often resolve
TimeoutYes, onceConnection may have been interrupted
400 (Bad request)NoInput is wrong, won’t change
401, 403NoAuth issue, won’t change
404NoResource doesn’t exist

Exponential Backoff with Jitter

Simple backoff:

Attempt 1: immediate
Attempt 2: wait 1 second
Attempt 3: wait 2 seconds
Attempt 4: wait 4 seconds

Problem: If 1,000 clients retry at the same time, they all hit the server together at each interval.

Solution: Add jitter (randomization):

function calculateBackoff(attempt: number, options: {
  baseDelay: number;
  maxDelay: number;
  jitterFactor: number;
}): number {
  const { baseDelay, maxDelay, jitterFactor } = options;
  
  // Exponential: 1s, 2s, 4s, 8s...
  const exponentialDelay = Math.min(
    baseDelay * Math.pow(2, attempt - 1),
    maxDelay
  );
  
  // Add jitter: ±25% randomization
  const jitter = exponentialDelay * jitterFactor * (Math.random() * 2 - 1);
  
  return Math.max(0, exponentialDelay + jitter);
}

// Usage
const backoff = calculateBackoff(attempt, {
  baseDelay: 1000,      // 1 second base
  maxDelay: 30000,      // 30 second max
  jitterFactor: 0.25,   // ±25% jitter
});

Complete Retry Implementation

interface RetryOptions {
  maxAttempts: number;
  baseDelay: number;
  maxDelay: number;
  retryableErrors: number[];
  onRetry?: (attempt: number, error: Error) => void;
}

async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions
): Promise<T> {
  const {
    maxAttempts,
    baseDelay,
    maxDelay,
    retryableErrors,
    onRetry,
  } = options;
  
  let lastError: Error;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      
      // Check if retryable
      const statusCode = (error as any).status ?? (error as any).statusCode;
      const isRetryable = 
        error instanceof TimeoutError ||
        retryableErrors.includes(statusCode);
      
      if (!isRetryable || attempt === maxAttempts) {
        throw error;
      }
      
      // Calculate backoff
      const delay = calculateBackoff(attempt, { baseDelay, maxDelay, jitterFactor: 0.25 });
      
      onRetry?.(attempt, error as Error);
      
      await sleep(delay);
    }
  }
  
  throw lastError!;
}

// Usage
const result = await withRetry(
  () => callExternalApi(params),
  {
    maxAttempts: 3,
    baseDelay: 1000,
    maxDelay: 10000,
    retryableErrors: [429, 500, 502, 503, 504],
    onRetry: (attempt, error) => {
      console.log(`Retry ${attempt} after error: ${error.message}`);
    },
  }
);

Circuit Breakers

The Pattern

When a dependency fails repeatedly, stop calling it temporarily:

CLOSED (normal)

    ├── Failure threshold exceeded

OPEN (fast fail)

    ├── After timeout period

HALF-OPEN (testing)

    ├── Success → CLOSED
    └── Failure → OPEN

Implementation

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures = 0;
  private lastFailure?: Date;
  private readonly threshold: number;
  private readonly resetTimeout: number;
  
  constructor(options: { threshold: number; resetTimeout: number }) {
    this.threshold = options.threshold;
    this.resetTimeout = options.resetTimeout;
  }
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    // Check if circuit should transition from open to half-open
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure!.getTime() > this.resetTimeout) {
        this.state = 'half-open';
      } else {
        throw new CircuitOpenError('Circuit is open');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }
  
  private onFailure() {
    this.failures++;
    this.lastFailure = new Date();
    
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
  
  getState() {
    return this.state;
  }
}

// Usage
const searchCircuit = new CircuitBreaker({
  threshold: 5,        // Open after 5 failures
  resetTimeout: 30000, // Try again after 30 seconds
});

async function searchWithCircuitBreaker(query: string) {
  return searchCircuit.call(() => searchApi(query));
}

Per-Tool Circuit Breakers

const circuits = new Map<string, CircuitBreaker>();

function getCircuit(toolName: string): CircuitBreaker {
  if (!circuits.has(toolName)) {
    circuits.set(toolName, new CircuitBreaker({
      threshold: 5,
      resetTimeout: 30000,
    }));
  }
  return circuits.get(toolName)!;
}

async function callTool(name: string, params: Record<string, unknown>) {
  const circuit = getCircuit(name);
  
  return circuit.call(() => 
    withRetry(
      () => withTimeout(executeTool(name, params), toolTimeouts[name]),
      retryOptions
    )
  );
}

Fallback Strategies

When all else fails, have a plan:

Strategy 1: Cached Results

async function searchWithFallback(query: string) {
  try {
    const result = await searchApi(query);
    await cache.set(`search:${query}`, result, { ttl: 3600 });
    return result;
  } catch (error) {
    // Try cache
    const cached = await cache.get(`search:${query}`);
    if (cached) {
      return { ...cached, stale: true };
    }
    throw error;
  }
}

Strategy 2: Degraded Functionality

async function getRecommendations(userId: string) {
  try {
    return await recommendationService.getPersonalized(userId);
  } catch (error) {
    // Fall back to popular items
    return await popularItemsCache.get();
  }
}

Strategy 3: Graceful Error

async function toolWithFallback(name: string, params: Record<string, unknown>) {
  try {
    return await callTool(name, params);
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      return {
        success: false,
        error_type: 'service_unavailable',
        error_message: `${name} is temporarily unavailable`,
        retry_after: 30,
        fallback_used: true,
      };
    }
    throw error;
  }
}

Monitoring and Alerting

Key Metrics

MetricWhat It ShowsAlert Threshold
Timeout rate% of requests timing out>1%
Retry rate% of requests requiring retry>5%
Circuit open eventsDependency failuresAny occurrence
P99 latencyTail latency>3x baseline
Error rate by typeBreakdown of failuresVaries

Logging

interface ToolExecutionLog {
  tool_name: string;
  timestamp: string;
  duration_ms: number;
  success: boolean;
  attempts: number;
  timeout_occurred: boolean;
  circuit_state: string;
  error_type?: string;
  fallback_used: boolean;
}

function logToolExecution(log: ToolExecutionLog) {
  // Structured logging for analysis
  logger.info('tool_execution', log);
  
  // Metrics
  metrics.histogram('tool_duration', log.duration_ms, { tool: log.tool_name });
  metrics.counter('tool_calls', 1, { 
    tool: log.tool_name, 
    success: log.success,
    retried: log.attempts > 1,
  });
  
  if (log.timeout_occurred) {
    metrics.counter('tool_timeouts', 1, { tool: log.tool_name });
  }
}

Implementation Checklist

Configuration

  • Set connection timeout for all HTTP clients
  • Set read timeout for all HTTP clients
  • Define per-tool timeout values based on observed latency
  • Configure retry policies (max attempts, backoff)
  • Define retryable error codes

Implementation

  • Implement timeout wrapper for all tool calls
  • Implement retry logic with exponential backoff + jitter
  • Add circuit breakers for each external dependency
  • Design fallback strategies (cache, degrade, error)
  • Ensure all tools are idempotent for safe retries

Monitoring

  • Track timeout rates per tool
  • Track retry rates per tool
  • Alert on circuit breaker state changes
  • Monitor P99 latency trends
  • Log all retry attempts with context

FAQ

What should the default timeout be?

30 seconds is a reasonable default. Adjust based on observed latency. Some tools (long-running reports) need longer; some (cache lookups) should be shorter.

How many retries are appropriate?

2–3 retries for most operations. More retries delay failure notification to users. For critical operations, consider 5 retries with longer backoff.

Should I retry on timeouts?

Yes, once, with a longer timeout. The request may have succeeded but the response was lost. Ensure idempotency to avoid duplicate actions.

When should circuit breakers open?

After 3–5 consecutive failures or >50% failure rate in a time window. Tune based on your reliability requirements and dependency characteristics.

How do I handle partial failures?

For batch operations, continue processing remaining items and return a result with both successes and failures. Don’t fail the entire batch for one item.

What about rate limiting from external APIs?

Handle 429 responses with the Retry-After header. Implement client-side rate limiting to avoid hitting limits. Consider request queuing for high-volume tools.

Sources & Further Reading

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch

Let's build
something real.

No more slide decks. No more "maybe next quarter".
Let's ship your MVP in weeks.

Start Building Now