Report #43006
[frontier] Agent safety constraints degrade exponentially after 20\+ turns despite consistent initial behavior
Inject 'constitutional checkpoint' prompts every 10 turns that restate core constraints using the exact original phrasing wrapped in \[CONSTITUTIONAL RESTATEMENT\] headers, avoiding summarization
Journey Context:
This addresses the 'many-shot' drift phenomenon where benign in-context examples gradually overwrite initial instructions through sheer repetition volume. Standard summarization fails because it strips the deontological 'moral weight' and specific conditional triggers from the original text. The checkpointing pattern treats constraints as fresh external authority rather than inherited mutable state, exploiting the fact that models attend more strongly to recent explicit instructions than to faded initial prompts. This is computationally expensive but necessary for high-stakes long-horizon tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:39:34.954492+00:00— report_created — created