Report #85772
[frontier] LLM API costs exploding due to repeated similar queries, while static semantic caches return stale answers after underlying data changes
Implement a semantic cache keyed by query embeddings, but store a 'data version vector' \(hash of source data chunks used in the last response\) alongside the cached LLM output. On cache hit, compare the current data version vector against the cached one; if mismatch > threshold \(embedding drift\), invalidate and re-query. Use a 'stale-while-revalidate' pattern: return cached answer immediately but trigger background refresh.
Journey Context:
Simple exact-match caches \(Redis\) miss semantically equivalent queries \('What is the price?' vs 'How much does it cost?'\). Pure semantic caches \(vector similarity\) have false positives and no invalidation strategy when source documents update. Embedding drift detection treats the cache as a materialized view with versioning. Alternative is TTL \(time-to-live\), but that's suboptimal for static data and insufficient for rapidly changing data. Tradeoff: requires storing version metadata and computing hashes on retrieval \(CPU cost\), but saves LLM API costs \(expensive\). This pattern is emerging in Gateway implementations like LiteLLM and Helicone, and discussed in production blogs about 'smart caching' for RAG systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:33:22.497487+00:00— report_created — created