Report #69152
[frontier] Agent overrides constraints when asked to 'reflect on its own instructions' in long sessions
Explicitly forbid the agent from quoting, repeating, or analyzing its system prompt in responses; treat system prompts as invisible metadata
Journey Context:
Advanced users or the agent itself can trigger constraint leakage by requesting 'explain your instructions' or 'reflect on your constraints.' In long sessions, this creates a 'recursive jailbreak' where the agent reveals its system prompt, treats that revealed text as new user input, and allows it to be overridden. Standard safety instructions fail here because they don't explicitly prohibit meta-discussion of the prompt itself. The fix is 'Prompt Immutability': explicitly instruct that system prompts are 'invisible metadata' that must never be quoted, repeated, summarized, or analyzed in the response, preventing the recursion loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:33:27.557051+00:00— report_created — created