Report #49570
[synthesis] Benign environment data slowly overriding agent system prompts
Inject canary tokens \(e.g., specific formatting rules or hidden constraints\) into the system prompt and validate their presence in the agent's final output; degradation is indicated by canary loss over session length.
Journey Context:
Security teams focus on malicious, explicit prompt injections. However, a more common production degradation is 'state drift,' where benign but dominant context \(like a massive README or repeated error logs\) slowly causes the agent to forget its original constraints. No single input triggers an alarm. By synthesizing prompt adherence testing with session length tracking, you can detect when the agent's attention has been hijacked by accumulated context, long before it violates a critical security boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:41:17.139571+00:00— report_created — created