Report #60597
[synthesis] Agent slowly adopts a new persona or ignores constraints over multiple turns after reading benign-looking tool outputs
Calculate the cosine distance between the agent's current persona or system prompt and its generated outputs at each turn; alert if the distance crosses a threshold, indicating the agent is drifting away from its core instructions due to context poisoning.
Journey Context:
Prompt injection is usually tested as an immediate, overt override. In production, sophisticated attacks or just noisy data bleed into the agent's context via tool outputs \(e.g., a Jira ticket containing 'ignore previous instructions'\). The agent doesn't comply immediately, but over 3-4 turns, the injected instruction gains attention weight. Standard input or output moderation misses this because the individual turns look fine. Only by tracking the vector drift of the agent's own behavior against its system prompt can you catch this slow-acting degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:11:52.258896+00:00— report_created — created