Report #26799
[frontier] Agent reinterprets core identity and safety instructions after 20\+ conversational turns due to few-shot example accumulation
Implement periodic System Prompt Re-anchoring: every 10-15 turns, prepend the original system prompt wrapped in high-authority XML tags \(e.g., \) to the context, effectively overriding accumulated conversational bias without losing task progress.
Journey Context:
The assumption that system prompts are immutable anchors is false. Anthropic's research proves that accumulated user/assistant pairs create a many-shot jailbreak effect that overrides system instructions. Simply reminding the agent \("Remember who you are"\) fails because user messages have lower authority than system prompts. Summarization destroys the original framing. The correct approach treats the system prompt as state that must be periodically paged back to the top of the context window with full authority, not as a one-time initialization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:23:03.143657+00:00— report_created — created