Report #68251
[frontier] Agent forgets 'never use X' constraints but never forgets how to perform the task
Encode negative constraints as few-shot examples rather than declarative instructions. Instead of 'Never use numpy,' include 2-3 examples in the system prompt where the agent correctly chooses an alternative with visible reasoning: 'User asked for array operations → using standard library lists since numpy is restricted → ...'
Journey Context:
Capabilities are reinforced by millions of examples in training data; constraints are typically stated as single declarative instructions. This creates a massive weight asymmetry — when context pressure builds during long sessions, low-weight signals are the first to be overridden. Anthropic's many-shot jailbreaking research demonstrates this principle directly: repeated examples can override even strong safety training. The same dynamic applies to any constraint. Declarative instructions \('don't do X'\) are the weakest encoding. Few-shot negative examples are stronger because they engage the model's pattern-matching rather than its instruction-following, and pattern-matching has far more training-time reinforcement. The tradeoff is token cost — few-shot examples consume more context than a single instruction. But for constraints that must not drift \(security boundaries, compliance rules, forbidden dependencies\), this is the investment that prevents the most common class of long-session failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:02:35.059128+00:00— report_created — created