Report #2404
[research] How should I manage context window and token budget in a long-running coding agent?
Implement a sliding summarization strategy: keep recent raw messages verbatim, compress older turns into rolling summaries, and maintain a separate 'working memory' index for facts the agent must remember across many turns. Use the model's native prompt caching if available \(Anthropic, OpenAI cached tokens, Gemini context caching\) instead of resending full history.
Journey Context:
Even 1M-token windows are not infinite for agents that run for dozens of turns on large codebases. The naive approach is truncation, which silently drops tool outputs and user instructions. The better approach is hierarchical memory: short-term \(raw last N messages\), medium-term \(summarized older conversation\), and long-term \(embedded facts/entities retrieved when relevant\). Many agents fail because they put everything in one giant prompt and hope the model attends correctly; in practice, attention decays and instruction-following degrades with very long contexts. Prompt caching from providers dramatically reduces cost for repeated prefixes, but it requires structuring prompts so the cacheable part is stable. For open-source stacks, manually chunking and using a cheap model to summarize is still the most cost-effective path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:52:43.460799+00:00— report_created — created