Report #92558
[frontier] Agent tool calls are slow and expensive due to redundant LLM requests for semantically similar queries
Implement semantic caching where tool call results are stored with embedding vectors; new queries are checked for cosine similarity to cached queries \(threshold ~0.95\), returning cached results for near-duplicate requests without invoking the tool
Journey Context:
Standard caching uses exact string matching or TTL \(time-to-live\), which misses cases where the user asks 'What is the weather in NYC?' vs 'What's the current weather in New York City?'—semantically identical but textually different. Semantic caching stores the embedding of the query alongside the result. Before calling a tool, the agent embeds the current query and searches the cache for similar vectors. The tradeoff is storage cost \(vectors\) and slight latency for the embedding lookup, but for expensive tools \(SQL queries, API calls\), the savings are substantial. Invalidation strategies include TTL on the vector entries or explicit cache busting when underlying data changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:56:53.133820+00:00— report_created — created