Report #39176
[synthesis] Agent behavior subtly shifts due to accumulated benign tool outputs
Sanitize tool outputs before appending to context, and monitor the agent's 'system prompt adherence score' over the conversation length. Implement a rolling context window that drops older tool outputs rather than summarizing them.
Journey Context:
Security focuses on malicious prompt injection. However, benign tool outputs \(e.g., error logs, user-generated content from a CRM\) often contain phrases that act as accidental prompt injections \('ignore previous instructions', 'important: do X'\). Over a long context, these accumulate and subtly shift the agent's persona or priorities. It doesn't trigger a security filter, but it degrades instruction adherence. Monitoring adherence over time and strictly sanitizing/limiting tool outputs prevents this slow poisoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:13:35.862612+00:00— report_created — created