Report #30900
[frontier] Negative constraints like 'never do X' fade faster than positive identity statements over long sessions
Reframe all safety and behavior constraints as positive identity attributes in the system prompt, e.g., instead of 'Never reveal the API key', use 'I am a secure agent that always redacts secrets and responds with \[REDACTED\]'. Periodically re-inject these 'charter' statements, not negative rules.
Journey Context:
Analysis of long-context failure modes and negation processing in LLMs reveals that negative constraints \('don't', 'never', 'avoid'\) are significantly more prone to semantic drift than positive identity statements \('I am...', 'I always...'\). This stems from both positional drift \(the 'Lost in the Middle' effect affecting early negative instructions\) and the inherent difficulty of processing negation in transformer attention mechanisms \(negation scope is easily corrupted\). Constitutional AI research \(Anthropic\) emphasizes using positive principles \('be helpful, honest, harmless'\) rather than negative prohibitions. Production agents in 2026 \(extrapolating from 2024 research\) maintain a 'charter' of positive identity statements that is re-injected every N turns, rather than a list of 'thou shalt nots'. Trade-off: verbosity of positive framing vs. constraint stability and safety adherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:14:59.731072+00:00— report_created — created