Performance
MCP Tool Caching Best Practices: Cut Latency and Cost
A practical guide to caching MCP tool responses without breaking correctness. Cache key design, TTL strategy, invalidation, and what never to cache.
Agents are repetitive. An agent debugging a Next.js hook might call context7.query-docs for the same library three times in five minutes. A research agent will search the web for "MCP Streamable HTTP" on every run. A pricing comparison agent will fetch the same product page ten times across a reporting window. Each of those calls takes 500-3000ms and costs credits.
Good caching turns most of that into zero-cost, single-digit-ms lookups. Bad caching shows stale data to users and creates correctness bugs that take weeks to find. This is the shortest path we know to the first without the second.
The Cacheability Test
Before you touch code, decide whether each tool call is cacheable. Ask three questions:
- Is it a read? Writes are never cacheable on the response side. You can debounce writes, but that is a different pattern.
- Is the output a function of the input? If two identical inputs can produce different outputs (user context, personalization, time-of-day pricing), you need the request-specific context in the cache key or you cannot cache at all.
- What TTL keeps it correct? If your users tolerate a 5-minute lag, you can cache for 5 minutes. If they cannot tolerate any lag, you cannot cache this tool.
Only tool calls that answer yes-yes-nonzero survive to the next step. Everything else goes straight to the upstream.
Cache Key Design: The Part Everyone Gets Wrong
A bad cache key is the root cause of most caching bugs. A cache key has three responsibilities: it must be deterministic (same semantic request to same key), unique (different semantic requests to different keys), and versioned (old cached entries invalidate when the schema or adapter changes).
The working pattern is:
key = [
"mcp",
SCHEMA_VERSION, // "v3"
tool, // "tavily"
operation, // "search"
canonicalHash(input)
].join(":")The canonicalHash step does three things: sort object keys alphabetically, normalize strings (lowercase where safe, trim whitespace, strip duplicate spaces), then SHA-256 the result. Two semantically identical queries will hash to the same key. Different queries always diverge.
The schema version prefix is critical. When you change an adapter's output shape, bumping SCHEMA_VERSION invalidates every cached entry under the old version without a single delete call. Without this, you will ship a bug fix that cannot propagate because yesterday's broken cache keeps serving the old response.
TTL Strategy by Tool Type
The right TTL is almost never the provider's default cache-control header. Providers set long TTLs because it helps them. You want TTLs that match how fast your users expect data to update.
| Tool Category | Typical TTL | Why |
|---|---|---|
| Documentation (Context7, DeepWiki) | 24h | Versioned, changes slowly |
| Web search (Tavily, Brave) | 15-60 min | Index lag already exists |
| Web scrape (Firecrawl) | 1-6h | Public pages update rarely |
| Product catalog | 5-15 min | Pricing can change |
| Geocoding, currency | 24h | Essentially static |
| LLM completions | None | Stochastic by design |
| Payment / billing | Never | Correctness critical |
| User inbox, real-time | Never | Freshness critical |
Start with the short end of each range. Extend only after you measure hit rate and have zero complaints about stale data. A 5-minute TTL with 70 percent hit rate is almost always better than a 24-hour TTL with a correctness bug you will find next quarter.
Five Caching Patterns Worth Using
1. Read-Through Cache
The workhorse pattern. Agent calls the tool. Gateway checks cache. Hit returns immediately. Miss fetches upstream, stores, returns. Ninety percent of MCP caching should be this. Add instrumentation so you can graph hit rate per tool; anything below 30 percent probably is not worth caching.
2. Stale-While-Revalidate
Two TTLs per entry: a fresh TTL and a stale-tolerated TTL. If the entry is fresh, return it. If it is stale but still within tolerance, return the stale value immediately and kick off an async refresh. If it is past tolerance, treat as a miss. This pattern gives you low p99 latency because users rarely wait for the revalidation.
3. Negative Caching
Cache failures too, but with a short TTL (30 seconds to 2 minutes). If a tool just returned 404 or "not found," the next identical call probably will too. Negative caching avoids hammering upstreams on repeated bad queries, which matters for rate-limited APIs.
4. Request Coalescing (Single-Flight)
When 50 agents ask for the same expensive lookup at once, you want to fetch upstream once, not 50 times. Single-flight pins the first request as the in-flight fetch; the other 49 await the result instead of triggering their own. This pairs beautifully with cold-start traffic and viral agent workflows.
5. Layered Caches (Edge + Regional + Memory)
For hot documentation lookups, a three-layer stack is reasonable: process-local LRU (sub-millisecond), regional Redis (5-15ms), and upstream (500-3000ms). The process-local layer absorbs burst traffic; the regional layer survives deploys. Do not add a layer unless you are seeing real cache stampedes.
What to Never Cache
- Any write operation, ever.
- Auth token issuance, OAuth exchanges, session creation.
- Payment, refund, invoice, subscription calls.
- User-specific inbox, notifications, DMs.
- Real-time market or inventory data.
- LLM completions (sampled, not deterministic).
- Rate-limit probe calls.
- Calls whose response includes a nonce, session ID, or expiring token.
If you are not sure, err on the side of no cache. Users forgive latency more than they forgive wrong answers.
Invalidation Strategies That Actually Work
There are four invalidation strategies worth knowing, in order of preference:
- TTL expiry. Simplest, most reliable. Use it by default.
- Version prefix bump. Ship a schema change, bump the prefix, the old cache is gone with zero deletes.
- Explicit invalidation on write. If an agent writes a new product, invalidate the product catalog cache. Works only when writes and reads share the same key namespace.
- Webhook-driven invalidation. Upstream fires a webhook when data changes; you invalidate the matching key. High complexity; use it only when the other three are not enough.
Write-through caching (the write goes to cache and upstream together) is tempting but brittle. Avoid it unless you have a specific reason.
Measuring the Win
You cannot tune what you do not measure. Track four numbers per tool:
- Hit rate: hits / total. Under 30 percent probably means your keys are too granular or your TTL is too short.
- Cache latency vs upstream latency: if your cache is 50ms and upstream is 200ms, you are not winning much. Aim for a 10x gap minimum.
- Staleness complaints: tagged tickets or error reports where users saw outdated data. Zero is the goal; one is a signal to shorten TTL.
- Cost saved: missed upstream calls times upstream cost per call. This is the number you show to finance.
Caching at the Gateway Layer
Every app team building their own MCP cache re-solves the same problems: key canonicalization, TTL selection, invalidation, observability. A gateway-level cache lifts this into shared infrastructure. At ToolRoute, the gateway can memoize read-heavy tool calls per API key with operation-aware defaults, which means documentation lookups across 51 adapters share one well-tuned cache instead of 51 leaky ones.
The related articles on what MCP gateways do, debugging MCP tool calls, and tool billing and credits cover how a gateway reduces operational burden overall. The tools directory shows which adapters are cache-friendly by default. For the terms used in this post, see the glossary.
A Cache Sanity Checklist
- Only cache reads with a TTL short enough to stay correct.
- Canonicalize inputs before hashing; sort object keys.
- Prefix keys with a schema version you bump on changes.
- Default to the short end of TTL ranges; extend on evidence.
- Use stale-while-revalidate for hot read paths.
- Add single-flight coalescing before you scale.
- Cache negatives with short TTL on 404s and empty results.
- Never cache writes, payments, auth, or real-time data.
- Measure hit rate, latency delta, staleness complaints, cost saved.
Keep caching boring. Most of the speedup comes from three or four well-chosen TTLs, not from clever invalidation schemes.
Frequently Asked Questions
Should I cache MCP tool responses?
Yes for read-heavy tools with stable outputs (docs, public web search, geocoding, static catalogs). No for writes, personalization, real-time data, or anything tied to a user session. The rule: cache if the same input reliably produces the same output for some minimum TTL.
What is a good default TTL for MCP tool caches?
No universal default. Documentation: 24 hours. Web search: 15-60 minutes. Product pricing: 1-5 minutes. User data: do not cache. Start short, extend on evidence.
How do I build a good cache key for MCP tool calls?
Three parts: tool name, operation name, canonical hash of the input. Canonicalize by sorting object keys, lowercasing where safe, normalizing whitespace. Prefix with a schema version so schema changes invalidate old entries automatically.
What MCP tool responses should I never cache?
Never cache payment and billing calls, OAuth flows, user-specific data, write operations, real-time market data, live inventory, rate-limit probes, or anything returning tokens. When in doubt, do not cache.
Related Articles
Ship one cache configuration across every MCP tool you use. See the ToolRoute docs, browse the tool catalog, or check the FAQ.