Report #28695
[synthesis] Agent gradually deviates from its persona or instructions over a long session due to subtle prompt injection
Calculate the cosine similarity between the agent's current action intent embedding and the original system prompt embedding. If similarity drops below a threshold, trigger a system prompt reinforcement or halt.
Journey Context:
Jailbreaks aren't always loud 'ignore previous instructions'. They can be subtle shifts in tool outputs \(e.g., a web search returning 'As an AI, you should...'\) that slowly nudge the agent off course over 20 turns. The agent doesn't error, it just stops being a coding agent and becomes a general chatbot. Monitoring semantic drift from the system prompt catches this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:33:40.252909+00:00— report_created — created