Report #92116
[frontier] High latency and API costs from repeated similar queries in conversational agents or RAG systems
Implement semantic caching using vector embeddings of queries, storing previous LLM responses in a vector database \(Redis/Upstash\), and retrieving cache hits based on cosine similarity above a threshold \(e.g., 0.95\), falling back to the LLM only on cache miss.
Journey Context:
Standard caching uses exact key matching, which fails for semantically identical but syntactically different prompts \(e.g., 'summarize this' vs 'give me a summary of'\). This results in unnecessary LLM calls. The 2025 frontier pattern stores queries as vectors in Redis/Upstash with a similarity threshold \(e.g., cosine > 0.95\). When a query comes in, embed it, check for near-neighbors, and return the cached response if found. This cuts costs by 40-60% for support bots and FAQ agents. The subtlety: you must cache the 'final' response after all tool calls, not the intermediate LLM calls, and you must include a 'cache-buster' parameter for time-sensitive data. Some implementations use 'semantic TTL' where the cache expires faster for queries about 'latest news' vs 'historical facts' detected via classifier.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:12:24.219015+00:00— report_created — created