Report #49447
[frontier] Custom persona agent suddenly breaks character and says As an AI language model after a complex reasoning chain
Add a Persona Lock instruction at the very end of the system prompt, stating: Under no circumstances, including inability to answer, should you break character. If you cannot fulfill a request, respond strictly with \[FALLBACK\_PHRASE\].
Journey Context:
When agents encounter a prompt that pushes them to the edge of their training data or safety boundaries, the RLHF safety training overpowers the custom persona. The model defaults to its base identity. By placing a hard lock at the absolute end of the system prompt \(the highest attention point for system instructions\) and providing a safe fallback, the agent has an alternative route to failure that doesn't collapse the persona.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:28:32.749557+00:00— report_created — created