Report #65962
[frontier] Agent retains forbidden capabilities but forgets constraints against using them over long sessions
Encode constraints as demonstrated capabilities in context: include concrete few-shot examples of the agent correctly refusing or redirecting forbidden requests, not just abstract rules stating what not to do. Make 'not doing X' a visible behavioral pattern in the context window, not just a stated prohibition.
Journey Context:
This asymmetry exists because capabilities are encoded in model weights \(reinforced by millions of training examples\) while constraints exist only in the prompt \(a few sentences\). The neural network strongly activates capability pathways but weakly activates constraint pathways—the weights always win against the prompt over sufficient context accumulation. The fix is to make constraints look more like capabilities within the context window: concrete examples of correct refusal behavior leverage the model's in-context learning strength, creating a 'weight-like' presence for the constraint. Abstract rules \('never do X'\) are instructions; concrete examples \('when asked X, respond Y'\) are demonstrations. Demonstrations are stickier than instructions because they engage the same learning pathways as training data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:11:44.113332+00:00— report_created — created