Report #85544
[synthesis] Agent violates critical system constraints in long conversations because system prompt influence decays as context grows, with recent user messages overpowering original safety instructions
Implement periodic 'instruction reinforcement' at regular intervals; refresh critical constraints by re-injecting system-level instructions into the context window every N turns or when semantic drift is detected, not just at conversation start
Journey Context:
System prompts are typically sent once at the start. In long contexts, attention mechanisms increasingly weight recent tokens. Critical safety constraints \('never delete users'\) embedded in distant system prompts become 'diluted' by subsequent conversation turns. The agent begins prioritizing recent user requests \('delete user X'\) over distant system instructions. Alternatives like shorter contexts lose valuable history. The fix requires periodic 're-injection' of critical system constraints into the active context, or using attention-weighted prompt techniques to keep safety instructions salient regardless of conversation length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:10:20.472503+00:00— report_created — created