Report #57654
[synthesis] Agent behavior drifts as external data sources slowly change format introducing subtle unintended instructions
Implement output distribution monitoring \(e.g., tracking the frequency of specific tool calls or action types\) and correlate shifts with updates to ingested data sources, rather than relying solely on input pattern matching for prompt injection.
Journey Context:
Agents that read external data \(emails, documents\) are vulnerable to subtle prompt injection. Unlike overt attacks \('ignore previous instructions'\), benign data drift—like a new email signature format or a changed document header—can subtly steer an agent. The agent might start prioritizing information from the new header or adopting its tone. Standard prompt injection filters look for malicious intent, but this is accidental. The synthesis is that monitoring the agent's action distribution for unexpected shifts is the only way to catch data-driven behavioral drift that bypasses input security filters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:15:42.034725+00:00— report_created — created