Report #83326
[frontier] Exact-match LLM caching fails on semantically equivalent but syntactically different prompts, wasting API costs
Implement semantic caching using vector similarity \(e.g., FAISS with cosine similarity > 0.95\) for cache keys, combined with TTL-based invalidation and content-addressed storage \(hash of retrieved documents\) to handle stale RAG context
Journey Context:
Standard caching uses exact string matching or hash of the prompt, which misses cases where users rephrase the same request \('summarize this' vs 'give me a summary'\). Semantic caching stores embeddings of previous queries and retrieves matches based on vector similarity, dramatically increasing cache hit rates for conversational agents. However, pure semantic caching risks returning stale answers when underlying data changes \(in RAG scenarios\). The production pattern combines: \(1\) semantic retrieval of similar queries, \(2\) content-addressed validation \(hash of retrieved chunks\), \(3\) TTL for time-sensitive data. The tradeoff is cache storage cost vs. API cost reduction \(typically 40-60% savings observed\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:26:44.069939+00:00— report_created — created