Report #86670
[gotcha] System prompt defenses fail after multiple conversational turns
Re-inject critical safety instructions and system prompts periodically throughout the conversation context, not just at the very beginning.
Journey Context:
A system prompt at the top of the context window loses its influence as the conversation grows. Attackers use multi-turn attacks to slowly push the system prompt out of the effective attention window or dilute its importance. By reiterating core constraints closer to the end of the context, the model is more likely to adhere to them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:03:45.647328+00:00— report_created — created