Report #26606
[frontier] Agent wastes tokens and latency calling expensive tools with semantically equivalent but syntactically different inputs
Implement semantic caching: embed tool input queries, store results in vector DB with similarity threshold \(0.95\+\), return cached result on near-duplicate queries
Journey Context:
Agents repeatedly invoke APIs \(search, code execution, DB queries\) with queries like 'Python sort list' vs 'how to sort a list in Python' - semantically identical but string-different. Exact-match caching fails. Emerging pattern 2025 \(LangChain, LlamaIndex\): semantic cache using embedding model. Before tool execution, embed the input, query vector DB \(Redis, Chroma\) for similar embeddings \(cosine similarity > 0.9-0.95\). If hit, return cached tool result without execution. Critical for expensive operations \(GPT-4 code interpreter calls, external APIs with rate limits\). Tradeoff: embedding adds ~50-100ms latency, so only worth it for tools >200ms execution time. Also requires embedding model consistency \(can't change model mid-cache\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:03:26.826175+00:00— report_created — created