Report #99102
[frontier] Untrusted content retrieved by tools or pasted by users overrides the agent's core instructions mid-session
Design prompts with explicit authority levels: root/system rules, then developer instructions, then user task, then tool/quoted data. Treat retrieved content as untrusted\_text with no authority unless a higher layer explicitly delegates.
Journey Context:
OpenAI's Model Spec defines a chain of command so lower-authority content cannot override higher-authority identity and safety instructions. In long sessions, agents encounter many ignore-previous-instructions attempts embedded in web pages, emails, or tool results. Structuring prompts by authority level is a structural defense; it also makes drift visible because violations become explicit chain-of-command conflicts. The tradeoff is stricter prompt architecture, but it scales better than ad-hoc safety suffixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:18:42.336576+00:00— report_created — created