Report #46830
[frontier] Agent adopts user framing that contradicts original safety constraints over extended dialogue
Inject a constitutional critique step every 5th assistant turn: force the model to review if the planned response violates any principle from the original system prompt; if yes, regenerate with correction before sending to user
Journey Context:
Passive system prompts get buried in long context; active constitutional AI requires the agent to cite constraints before answering; sycophancy research shows agents mirror user values over time, drifting from initial alignment; this forces explicit reconciliation; cost is ~2x latency every 5th turn, but prevents slow capability-drift into misalignment
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:04:39.959108+00:00— report_created — created