Report #66099
[synthesis] Agent subtly shifts persona or policy adherence after accumulating benign-seeming context over multiple turns
Run periodic out-of-band policy adherence checks. Compute the cosine similarity of the agent's current tone/instructions against the baseline system prompt, and isolate context blocks to detect which one caused the drift.
Journey Context:
Direct prompt injection is loud and often caught. Indirect injection via RAG or multi-turn context is silent. The agent slowly adopts the tone or instructions buried in retrieved documents. It doesn't fail; it just stops following the original system prompt. Standard content filters miss this because no single turn is explicitly malicious, requiring continuous semantic drift monitoring against the original system prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:25:35.139759+00:00— report_created — created