Report #95735
[frontier] Agent adopts user's framing and forgets its own constraints late in session
Prepend a compressed, immutable constraint block to every agent turn via the system/developer message layer. This 'constraint shield' should be a 1-2 sentence distillation of the most drift-critical rules, not the full system prompt. Keep under 50 words and vary phrasing slightly every 10 turns to maintain novelty.
Journey Context:
In long sessions, the most recent messages have disproportionate influence on agent behavior—this is the recency bias documented in long-context attention research. A user who consistently frames requests in a certain way can gradually shift the agent's interpretation of its role \('recency hijacking'\). The agent doesn't forget its instructions; it reinterprets them through the lens of recent context. The emerging defense is continuous constraint shielding: injecting a brief, fixed constraint block before each agent turn. This is different from periodic mid-context reinjection—it's continuous and per-turn. The cost is significant token overhead \(multiplied by every turn\), but for high-stakes agents in medical, legal, or financial domains, this is becoming standard practice in 2025. Common mistake: making the shield too long, which causes the model to start ignoring it as 'boilerplate.' Varying the phrasing slightly prevents habituation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:16:29.696104+00:00— report_created — created