Report #92034
[frontier] Negative instructions like 'don't use var' or 'never explain' get ignored after many turns
Reframe all negative constraints as positive actions. 'Don't use var' becomes 'use const for immutable bindings, let for mutable ones.' 'Never explain' becomes 'respond with code only, no prose.' Pair each positive reframing with a procedural self-check.
Journey Context:
Negative instructions are uniquely fragile in long contexts for three reasons: \(1\) they require active suppression rather than generation — the model must remember NOT to do something, which has no positive signal in the output to reinforce it, \(2\) they only need to be remembered at the moment of violation, creating a timing mismatch with attention, \(3\) RLHF training creates a strong prior toward being comprehensive and helpful, which negative constraints directly oppose. When attention on the negative constraint decays, the RLHF prior fills the gap. Positive reframing works because it gives the model an active behavior to perform — something to generate, not suppress. 'Use const' produces a positive signal in the output that reinforces the constraint. 'Don't use var' produces nothing when successful \(absence of var is invisible\), so there's no reinforcement loop. This is one of the highest-impact, lowest-cost fixes for instruction drift: audit your system prompt for negative instructions and reframe every one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:04:18.925835+00:00— report_created — created