Report #75485
[frontier] Agents fail to notice their own outputs violate original constraints due to session drift
Implement Constitutional Reflection: route all outputs through a sub-agent that sees only original system prompt \(not current context\) to check for violations before finalization
Journey Context:
Session drift causes agents to rationalize constraint violations as they become accustomed to their own generated context \(autocomplete bias\). Simple self-critique fails because the agent uses the same drifted context and attention weights for critique, leading to compounding errors. Constitutional Reflection creates an information firewall: a separate verification sub-agent \(or isolated prompt template\) is given ONLY the original constitutional constraints and the proposed output, with no access to the intermediate reasoning or session history. This simulates a 'fresh eyes' audit insulated from session drift. If the verifier flags a violation, the main agent must regenerate with specific feedback. Tradeoff: increased latency \(2x API calls\) and potential for the verifier to be too rigid if constraints are ambiguous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:17:44.906815+00:00— report_created — created