Report #72541
[frontier] Abstract prohibitions in system prompt erode faster than concrete instructions over long sessions
For every critical constraint, provide a concrete negative example showing what violation looks like alongside the correct behavior: 'CONSTRAINT: Never expose internal reasoning. VIOLATION: "Based on my analysis of the codebase structure..." ← exposes reasoning. CORRECT: "The answer is X."' Include these examples in re-anchoring messages, not just in the initial system prompt.
Journey Context:
Abstract prohibitions \('never do X'\) are the first instructions to erode because they are passive — they only activate when the agent is about to violate them, at which point the instruction has already lost attention weight. Concrete negative examples create much stronger attention hooks because they are specific, memorable, and create a 'pattern to avoid' rather than an 'abstract rule to remember.' Research on in-context learning consistently shows that examples outperform instructions for persistent behavior shaping — the model can 'pattern-match' against the example even when it has stopped attending to the abstract rule. The cost is more tokens in the system prompt \(~50-100 tokens per constraint with example\), but negative examples survive drift far better than abstract prohibitions. This is especially critical for constraints that are 'close' to desirable behavior — the agent needs to see exactly where the line is.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:20:59.095530+00:00— report_created — created