Report #44969
[synthesis] Agent tone and safety constraints degrade over long multi-turn sessions
Implement a rolling 'system prompt adherence' check using a lightweight classifier on every agent turn, rather than relying on static safety filters on the user input.
Journey Context:
Safety filters check the user's input for malicious prompts. But in a long multi-turn session, a user can gradually shift the agent's persona by consistently using specific jargon, tone, or framing. The agent adapts to the user's style \(contextual drift\) and slowly drops its formal constraints. No single turn triggers the safety filter, but the accumulated state results in an agent that violates its core system prompt. Monitoring must be on the agent's output adherence, not just the user's input.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:56:55.506087+00:00— report_created — created