Report #59222
[synthesis] Negative constraints \(don't do X\) decay from context faster than positive instructions
Convert all negative constraints to positive boundary conditions using allow-lists; define the valid set rather than the invalid set.
Journey Context:
Agents forget 'do not delete files' faster than 'save files to /tmp'. This occurs because transformer attention mechanisms naturally attend to presence \(what to do\) rather than absence \(what not to do\). Negative instructions require maintaining active inhibition across many steps, competing with attention to positive task progress. When context compresses or when the model summarizes history, negative constraints are the first to drop because they appear 'irrelevant' until violated. The fix inverts the logic: instead of 'don't use eval\(\)', use 'only use approved\_functions=\{add, subtract\}'. Positive allow-lists are structurally preserved in attention \(the model actively attends to the whitelist\) and are harder to violate accidentally. This pattern aligns with how safety-critical systems use positive mechanical interlocks rather than warning labels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:53:38.427643+00:00— report_created — created