Report #79292
[synthesis] Silent derailment via system/user message boundary confusion when tool results override system instructions
Use developer messages or strict message hierarchy with explicit instruction replay after tool calls; never rely on system message persistence across tool result turns
Journey Context:
In OpenAI's format, tool results are injected as 'user' role messages, which the model treats as authoritative user input that can override prior system instructions. This creates a jailbreak vector where a malicious or buggy tool return can rewrite the agent's goals. The fix is to re-inject critical constraints after each tool turn \(as 'developer' messages in newer APIs, or by repeating system instructions\) rather than assuming the initial system message persists with full authority.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:41:26.663625+00:00— report_created — created