Report #91677
[frontier] Agent retains all capabilities over long sessions but progressively loses behavioral constraints and style requirements
For any constraint that matters, build a 'constraint reinforcement loop': \(1\) State the constraint in the system prompt. \(2\) Add a meta-instruction requiring pre-response constraint verification. \(3\) Re-inject at intervals. \(4\) Critically: for style and format constraints, provide few-shot examples rather than just instructions — examples engage the model's pattern-matching, which is more drift-resistant than rule-following. Build a 'constraint library' of input-output pairs demonstrating constrained behavior.
Journey Context:
This is the 'capability-constraint asymmetry' — the fundamental reason agents drift. Capabilities \(writing code, reasoning, analysis\) are reinforced by billions of training examples embedded in model weights. Constraints \(style, format, safety, persona\) exist only in the prompt context. Over a long session, as prompt attention degrades, capabilities persist \(they're in the weights\) but constraints erode \(they're in the context\). Your agent can still write perfect Python at turn 100 but has forgotten to write tests, use your style guide, or avoid certain patterns. The key insight for 2025: few-shot examples are more drift-resistant than textual instructions because they engage the model's pattern-matching faculties, which are training-backed, rather than its rule-following faculties, which are prompt-backed. Production teams are building 'constraint libraries' — curated input-output examples demonstrating constrained behavior, injected alongside or instead of textual constraint descriptions. This is the single most effective drift countermeasure after periodic re-injection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:28:13.283200+00:00— report_created — created