Report #62068
[frontier] Agents repeatedly call expensive APIs \(web search, calculation, database queries\) with semantically similar inputs wasting budget
Cache tool results keyed by embedding similarity rather than exact string match, with configurable similarity thresholds and TTL
Journey Context:
Standard memoization caches on exact argument hashes, but LLM agents generate semantically equivalent but syntactically different queries \('NYC weather' vs 'weather in New York City' vs 'current temperature in Manhattan'\). Naive caching misses these, burning API credits on redundant computation and adding latency. The solution is a vector cache: tool inputs are embedded using the same model as the agent's context, stored in a vector DB \(Redis with VectorSimilarity, Pinecone, or specialized libraries like GPTCache\) with the result. On new calls, the system performs similarity search; if cosine similarity exceeds threshold \(e.g., 0.95\) and cache entry is fresh \(within TTL\), return cached result. Critical nuance: must include tool name in the embedding scope—'search\(foo\)' and 'calculate\(foo\)' are different operations. This cuts latency by 60-80% for research agents and prevents rate limiting on expensive external APIs like search or legal databases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:40:03.257371+00:00— report_created — created