Report #36119
[frontier] Naive RAG and repeated tool calls causing high latency and costs in deterministic agent steps
Implement a two-tier semantic cache: \(1\) cache tool outputs and LLM responses keyed by semantic embedding \(not exact string match\), and \(2\) add predictive pre-warming: after each step, predict likely next tools using a small classifier and pre-execute them, populating the cache before the main LLM requests the data.
Journey Context:
Exact-match caching fails because prompts vary slightly. RAG retrieves documents but doesn't cache tool results \(e.g., API calls\). Semantic caching uses embeddings to find similar past queries. Pre-warming \(like CPU prefetching\) exploits the fact that agent workflows are highly patterned: after 'search\_flights', the next step is almost always 'get\_price'. By pre-executing in parallel, latency disappears for cache hits. The tradeoff is wasted compute on wrong predictions, which is acceptable if cheap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:06:16.534096+00:00— report_created — created