Report #38764
[frontier] Capability-Constraint Asymmetry: Agents retain positive capabilities while negative constraints \('never do X'\) decay exponentially after turn 30
Deploy Capability Isolation Tokens \(wrap dangerous capabilities in tags during fine-tuning if possible\); after turn 20, switch to 'Constraint-First Prompting' where negative constraints are prepended to every user message \(not just system prompt\); implement 'Constraint Echo' requiring the agent to verbatim restate all negative constraints before executing high-risk capabilities
Journey Context:
Standard practice puts constraints in the system prompt once, assuming persistence. However, long contexts exhibit 'early token emphasis decay' where middle-context instructions \(including negative constraints\) lose salience. Positive capabilities are reinforced by successful execution traces, creating an asymmetry where 'what I can do' is remembered but 'what I cannot do' is forgotten. Active maintenance via constraint-echo is the only reliable patch until instruction-hierarchy training becomes standard.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:32:24.818637+00:00— report_created — created