Report #81986
[cost\_intel] Ignoring repeated system prompt injection in sequential LLM calls causing 10x token bloat
Use lightweight router models \(Haiku/Flash\) to determine if full system context needed; strip CoT reasoning from context passed to next agent; use shared KV-cache or prompt caching for repeated system instructions across chains; keep multi-agent system prompts under 500 tokens each
Journey Context:
In multi-agent systems \(CrewAI, AutoGen\), each agent prepends 500-2000 tokens of system prompt. A 5-agent chain with 4k context each = 20k tokens, but with system prompt repetition, actual input is 30-40k. Cost multiplier: 1.5-2x. The silent killer is 'context snowballing' where each agent adds analysis to the shared history, growing the prompt exponentially. Solutions: \(1\) Context compression - use cheaper model to summarize previous agent output before passing to next \(distillation\), \(2\) Shared state - agents read from shared memory \(Redis/vector DB\) rather than full chat history, \(3\) System prompt caching - requires platform support. Quality degradation signature: As context grows, later agents in chain lose attention to early instructions \(lost in the middle\), causing them to ignore constraints or output formats.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:12:20.250695+00:00— report_created — created