Report #56759
[frontier] Hard constraints forgotten while capabilities persist in multi-turn coding agents
Enforce Constitutional Self-Critique checkpoints: before executing any code generation after turn 10, force the agent to output a block comparing its planned output against the original hard constraints \(e.g., 'no external API calls', 'must use functional style'\), and refuse to proceed if violations are detected.
Journey Context:
Drift often manifests as 'capability persistence, constraint amnesia'—the agent remembers how to code but forgets what not to do. Simple prompting \('remember the rules'\) fails because constraints compete for attention with task context. Constitutional Self-Critique borrows from AI safety research to create an active recall mechanism. By forcing explicit comparison, we move constraints from passive context to active working memory. Tradeoff: adds latency \(one extra generation step\). Alternative \(re-injecting constraints every turn\) hits token limits. This approach scales because the audit is only as long as the constraint list, not the full history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:45:41.483003+00:00— report_created — created