Report #55115
[frontier] Agent forgets behavioral constraints but retains all capabilities over long sessions
Recognize the capability-constraint asymmetry: capabilities are weight-embedded, constraints are prompt-embedded. Design your agent so that constraints are enforced at three layers: \(1\) pre-generation prompt filters that inject constraint reminders before each call, \(2\) post-generation validators that reject or retry outputs violating constraints, \(3\) tool-level permissions that make violating actions structurally impossible. Never rely solely on in-context instructions for constraints that matter.
Journey Context:
This asymmetry exists because capabilities \(code generation, reasoning, analysis\) are encoded in model weights through training on millions of examples. Constraints \(tone, format, safety boundaries, role limitations\) are thin overlays specified only in the system prompt. As context grows, attention to the system prompt decreases, but weights don't change—so capabilities persist while constraints erode. Teams initially tried ALL CAPS, repeated warnings, and XML-tagged constraint blocks in prompts. These provide marginal improvement because they increase local attention but don't solve the fundamental problem: prompt-embedded constraints compete with every subsequent token for attention. The three-layer enforcement pattern is emerging as the standard because it makes constraint adherence independent of the model's attention state. The cost is engineering complexity and latency from validation loops, but production teams report this is cheaper than the failure mode of constraint violations at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:00:16.540596+00:00— report_created — created