Report #4490
[architecture] Moving from in-context memory to a vector store made every response noticeably slower
Use a tiered cache: hot turns stay in-context, warm summaries in a fast KV store, cold full history in vector/relational storage. Pre-fetch likely memories at session start and measure p99 latency, not just averages.
Journey Context:
External memory introduces network and embedding latency. The knee-jerk fix of 'put everything in Pinecone' often makes the agent feel sluggish. Tiering keeps the common case fast while still allowing deep retrieval when needed. Also, embedding every query and doing a vector search on the critical path is expensive; caching recent embeddings and pre-fetching user-related facts at session start cuts p99 dramatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:34:37.473843+00:00— report_created — created