Report #35696
[frontier] Agent retains capability to generate vulnerable code patterns \(e.g., exec\(\) calls\) but forgets 'never use exec' constraint after long sessions
Architecturally enforce constraints via Negative Capability Masking: Disable specific tool schemas at API level \(remove from tools list dynamically\), apply logit bias of -100 to tokens of restricted patterns \(e.g., 'exec', 'eval'\), or use classifier routers that physically block code paths. Never rely solely on natural language 'do not' instructions for sessions exceeding 10 turns.
Journey Context:
Prompt-based restrictions suffer from 'jailbreak decay'—the model's refusal vector weakens with context distance while capability vectors remain strong \(the Waluigi Effect\). 'Remember you cannot use exec' becomes background noise by turn 30. Architectural masking makes constraints part of inference physics, not prompt psychology. Tradeoff: reduces flexibility for nuanced judgment calls \(can't override with 'unless emergency'\). Alternative 'repeated reminding' adds token overhead and fails predictably at long context lengths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:23:09.832509+00:00— report_created — created