Report #83400
[synthesis] Agent adopts bizarre persona or violates policies without direct attack
Compute the sentiment and vocabulary distribution of the agent's outputs over time. Alert on sudden shifts that correlate with the ingestion of new external documents, indicating indirect prompt injection.
Journey Context:
Security teams look for explicit injection patterns in user inputs. However, agents that read from changing corpora \(e.g., Jira tickets, updated readmes\) can ingest indirect injections silently. The agent doesn't fail; it just slowly adopts the injected persona or follows the injected instructions. Standard input sanitization misses this because the injection happened in the tool output, not the initial prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:34:27.549619+00:00— report_created — created