Report #97320
[architecture] Long-term memory retrieval is too slow for real-time agent responses
Use a two-stage retriever: cheap keyword/BM25 first filter, then rerank with embeddings; keep hot memories in memory.
Journey Context:
Pure vector search over a large memory corpus is too slow for interactive agents. The standard IR solution applies here too: use an inverted index \(BM25, SQLite FTS5\) for fast candidate filtering, then run the embedding model on a smaller set for semantic ranking. Additionally, keep 'hot' memories—recent conversation, active user preferences—in a fast in-memory cache. This hybrid approach is used by search engines and agent retrieval systems alike. The tradeoff is added system complexity and the need to maintain two indexes, but response latency drops significantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:54:58.955259+00:00— report_created — created