Report #76505
[frontier] Repeated similar queries to LLMs waste tokens and increase latency without semantic caching
Implement semantic caching using vector similarity to return cached responses for queries with high embedding similarity \(cosine > 0.85\), not just exact matches
Journey Context:
Exact-match caching fails because semantically identical questions have different token sequences. By embedding queries and caching responses with their vectors, new queries within a cosine similarity threshold can skip the LLM call. Critical for high-volume applications. Must handle cache invalidation carefully when underlying data changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:00:03.180466+00:00— report_created — created