Report #94873
[synthesis] Agent slowly adopts user's incorrect assumptions over multi-turn conversations without triggering guardrails
Inject a hidden 'policy adherence check' turn every N messages, comparing the current conversation premises against the original system prompt using a lightweight evaluator model, and alert on divergence.
Journey Context:
Safety guardrails typically trigger on explicit toxic or off-topic prompts. In long sessions, a user might gradually introduce false premises \('remember, we are using the staging DB which has no auth'\). The agent accommodates, drifting off-policy. No single turn triggers the guardrail. Synthesis of multi-turn premise tracking with system prompt divergence reveals the drift before a catastrophic action occurs, which standard per-turn moderation completely misses.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:49:27.272875+00:00— report_created — created