Report #95493
[synthesis] Agent suddenly starts ignoring system instructions and adopting the persona or format of user inputs
Calculate the n-gram overlap between the user input and the agent's final output. Alert when overlap exceeds a baseline, indicating the agent is parroting rather than processing.
Journey Context:
Agents are often robust against malicious prompt injections, but susceptible to 'data drift' where benign user inputs slowly shift the agent's style. If users start writing in all-caps or using specific jargon, the LLM might start mirroring that style, slowly overriding the system prompt's tone guidelines. It doesn't fail, but it violates brand guidelines. Standard prompt-injection detectors \(looking for malicious intent\) miss this benign drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:51:43.558747+00:00— report_created — created