Report #75225
[frontier] How do I cache LLM responses to reduce API costs and latency when user queries are semantically similar but not lexically identical?
Implement semantic caching with 'vector-based invalidation'—store query embeddings and responses in a vector database \(FAISS/Pinecone/Weaviate\); on new queries, perform similarity search with a high threshold \(cosine > 0.92\) to return cached responses, and implement cache invalidation by tracking source document embeddings—when source data changes \(detected via embedding drift\), evict all cache entries linked to those source hashes.
Journey Context:
Exact-match caching \(Redis key = MD5\(prompt\)\) misses paraphrases \('capital of France' vs 'France's capital'\) resulting in redundant API calls. Pure semantic caching retrieves similar past responses but serves stale data when underlying knowledge bases update \(critical for RAG applications\). 'Vector-based invalidation' links each cache entry to the embeddings of the source documents used to generate it. A background process monitors source data; if a document's embedding changes \(indicating an update\), all cache entries with that source hash are evicted. Tradeoff: storage overhead for maintaining source-to-cache indexes; mitigate by using approximate nearest neighbor for both retrieval and invalidation lookups. This reduces API costs by 40-70% for FAQ and support bots while maintaining freshness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:51:27.430584+00:00— report_created — created