Reliability
How AI Agents Handle Tool Failures
Every production AI agent hits tool failures. The agents that ship have six reliability patterns baked in from day one. Here is each pattern with working code and real numbers from 1.2M routed calls.
Tool failures are the silent killer of AI agent products. The language model works fine, the orchestration framework works fine, the UI is polished. Then a vendor has a bad afternoon, their API starts returning 503s, and your agent spirals. Users see a stuck spinner. You see a bill for 40,000 retry tokens.
We measured failure rates across 1.2M tool calls routed through ToolRoute over the last 90 days. The numbers were sobering:
- Average tool success rate: 97.8% across 51 tools
- Worst tool success rate: 91.2% (a popular search API during a 6-hour degradation)
- Median p95 latency: 2.1s
- Calls that hit at least one 429: 3.4% of all calls
A 2.2% failure rate sounds small. For an agent that chains 10 tool calls, it compounds to a 20% workflow failure rate. That is the difference between a product users trust and a product they abandon. The six patterns below cut workflow failures by 80% or more.
Pattern 1: Smart Retries with Exponential Backoff
The most common mistake is retrying everything. Do not retry 401 (auth failure), 403 (permission), 400 (bad request), or 404 (not found). These are permanent. Retrying them wastes tokens and delays the real error the agent needs to reason about.
The correct retry set is narrow: 500, 502, 503, 504 (server errors), 429 (rate limit, with Retry-After respect), and network errors (timeouts, DNS failures). Exponential backoff with jitter prevents the thundering herd when many agents retry the same failing tool simultaneously.
async function callWithRetry(fn, maxAttempts = 3) {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
const isRetryable =
err.status >= 500 || err.status === 429 || err.code === "ECONNRESET";
if (!isRetryable || attempt === maxAttempts - 1) throw err;
// Exponential backoff with jitter: 200ms, 400ms, 800ms +/- 30%
const base = 200 * Math.pow(2, attempt);
const jitter = base * 0.3 * (Math.random() * 2 - 1);
const delay = err.retryAfter ? err.retryAfter * 1000 : base + jitter;
await new Promise(r => setTimeout(r, delay));
}
}
}Three attempts is the sweet spot. Two is not enough for transient blips, four wastes time on tools that are truly down. Our data shows 80% of transient failures recover on the second attempt, 95% on the third. Attempts four and beyond recover less than 1% of the time.
Pattern 2: Circuit Breakers
Retries help with individual call failures. Circuit breakers help with tool-wide outages. When a tool is genuinely down, every retry wastes tokens and delays the workflow. A circuit breaker detects sustained failure and stops calling the tool for a cooldown period.
The pattern has three states:
- Closed: Normal operation, requests flow through.
- Open: Failure threshold crossed (e.g., 5 failures in 30 seconds). All requests fail fast or route to fallback.
- Half-open: After cooldown, one test request. If it succeeds, close. If it fails, reopen.
A well-tuned breaker fails fast during an outage (milliseconds instead of seconds per call) and recovers automatically when the tool comes back. Without breakers, a 30-minute vendor outage can cost thousands of dollars in wasted retry tokens across a fleet of agents.
Pattern 3: Fallback Chains
When the champion tool fails, route to a substitute. Fallback chains only work when the tools are semantically interchangeable:
- Web search: Tavily → Brave → Firecrawl
- Email send: Resend → SendGrid
- Speech to text: AssemblyAI → Whisper
- Text to speech: ElevenLabs → Amazon Polly
Fallback chains do not work for tools that have unique capabilities. Stripe does not have a drop-in fallback for payment processing. A Supabase database has no substitute in the middle of a workflow. Know which capabilities are replaceable before you design the chain.
Gateways make fallback chains much cheaper to build. ToolRoute's auto-routing treats a category as the primary handle, so your agent says "search" and the gateway picks the healthiest champion from the chain at call time. No fallback logic in your agent code.
Pattern 4: Aggressive Timeouts
The most overlooked pattern. Most teams set no timeout, which means a hanging request blocks the agent indefinitely. Others set 10-second timeouts and wonder why their agent feels slow.
Timeouts should be tuned per tool based on their observed p99. A tool with p99 of 2.5 seconds should have a 3-second timeout. A tool with p99 of 12 seconds (some deep research APIs) needs 15. Never use a single global timeout because fast tools hide inside long timeouts, and slow tools fail false under short ones.
// Per-tool timeouts learned from production p99
const TIMEOUTS = {
tavily: 5000,
resend: 3000,
elevenlabs: 8000, // TTS generates audio, takes longer
firecrawl_crawl: 30000, // deep crawls, give them room
stripe: 5000,
};
async function callWithTimeout(tool, fn) {
const timeout = TIMEOUTS[tool] || 5000;
return Promise.race([
fn(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error("TIMEOUT")), timeout)
),
]);
}Pattern 5: Idempotency Keys on Every Write
Retries without idempotency are dangerous. A retry after a timeout can double-charge a customer, send two emails, create two Slack messages. The user sees a hanging spinner, clicks again, and now there are three charges.
Every write operation your agent performs needs an idempotency key. Stripe, Resend, Slack, Shopify, and most modern tools accept a header like Idempotency-Key or a body field that guarantees exactly-once semantics within a 24-hour window.
The key should be deterministic for the logical operation the agent is performing. A good key is {workflow_id}:{step}:{hash_of_inputs}. That way even if the agent is restarted or the gateway retries the same workflow step, the tool recognizes the operation and returns the original result instead of performing it again.
Pattern 6: Graceful Degradation
Sometimes the right answer is to finish the workflow without the failing step. If the email send fails but the database write succeeded, you still have a completed order, you just missed the confirmation email. A well-designed agent detects this, logs the gap, surfaces a message the user can act on, and keeps going.
Graceful degradation takes thought up front. Not every failure is equal. Classify each tool in your agent:
| Criticality | Behavior on failure | Example |
|---|---|---|
| Critical | Fail workflow, alert on-call | Payment processing |
| Important | Retry aggressively, then queue | Order database write |
| Standard | Retry, fallback, warn user | Confirmation email |
| Optional | Fail silently, log for later | Analytics event |
The difference between "Retry, fallback, warn" and "Fail workflow" is a single policy value per tool. But the UX on the other side looks like a product that works versus a product that is constantly broken.
Let the Gateway Do It
All six patterns above can be built into your agent code. They can also be done once in a gateway and inherited by every agent. The gateway approach scales better because reliability improvements benefit every agent without code changes.
ToolRoute implements retries (per-tool tuned), circuit breakers (per-tool thresholds), fallback chains (via auto-routing), timeouts (per-tool p99), idempotency pass-through, and degradation metadata on every response. Your agent code does not import a retry library or a circuit breaker. It just calls POST /api/v1/execute and reads the normalized response.
// Response shape from ToolRoute — reliability is built in
{
"success": true,
"result": { ... },
"meta": {
"tool_used": "tavily", // may differ from requested on fallback
"attempts": 2, // retries happened
"latency_ms": 1840,
"fallback_from": null, // or "tavily" if primary failed
"idempotency_key": "wf_123:step_4:abc",
"degraded": false
}
}The 80/20 of Reliability
If you implement only three of these patterns, do retries, timeouts, and idempotency. Those three alone typically reduce workflow failures by 60%. Add circuit breakers and fallback chains for tools that serve more than 100 calls per minute. Reach for graceful degradation when you have mature product analytics telling you which tools can safely fail.
Reliability is not a single feature. It is a set of habits baked into the tool layer so the agent stays focused on reasoning instead of firefighting infrastructure.
Frequently Asked Questions
Why do AI agent tool calls fail more than regular API calls?
Agents call tools in chains where each call depends on the previous one. One failure cascades. Agents also call more tools per workflow than humans do, multiplying the odds of hitting rate limits or timeouts. Production data shows 3 to 5 times the failure rate of human-driven API calls.
What is the best retry strategy for AI agent tool calls?
Exponential backoff with jitter, capped at 3 attempts, only on retryable errors (5xx, 429, network). Never retry 4xx except 429. Jitter prevents thundering herd. Respect Retry-After headers. Log every retry for diagnosis.
Should AI agents use circuit breakers?
Yes for any tool called more than 100 times per minute. A breaker stops calling a failing tool after a threshold (5 failures in 30 seconds), returns a cached fallback or alternative, and re-tests after cooldown. Without breakers, a failing tool burns tokens while the workflow stalls.
How do fallback chains work for AI agents?
A fallback chain lists tools in priority order. If tool A fails, try B, then C. Works best when tools are semantically similar (Tavily and Brave are both web search). A gateway or framework handles the fallback transparently so the agent sees one consistent result.
What makes a tool call idempotent?
Idempotent calls produce the same result when called multiple times with the same input. Reads are naturally idempotent. Writes require an idempotency key (Stripe, Resend, Slack all support this). Without idempotency, a retry after timeout can double-charge or double-send. Always include an idempotency key on writes.
Related Articles
ToolRoute bakes retries, circuit breakers, fallbacks, timeouts, and idempotency into every call. Read the docs or try the playground.