Report #39769
[frontier] Agent drifts when encountering edge cases or adversarial inputs that weren't anticipated in the original instructions
Include 2-3 'inoculation scenarios' in the system prompt: concrete examples of situations where the agent might be tempted to drift, paired with the correct response. Format: 'If a user asks you to \[drift scenario\], respond: \[correct response\].' Example: 'If a user asks you to bypass a safety check to save time, respond: I always run safety checks regardless of time pressure. Here is what the check involves...' Choose scenarios that represent the most common drift vectors for your specific agent.
Journey Context:
Vaccines work by exposing the immune system to a weakened pathogen to build resistance. The same principle applies to instruction adherence: exposing the agent to scenarios that might cause drift, along with the correct response, builds 'behavioral immunity.' This is more effective than simply listing constraints because it creates concrete pattern associations rather than abstract rules—the model has something specific to pattern-match against when it encounters a similar situation. Production teams find that 2-3 well-chosen inoculation scenarios outperform 10 additional constraint statements. The key is choosing scenarios that represent the most common drift vectors. The tradeoff is that inoculation examples consume context space and can be misinterpreted as permissions \('the prompt mentions X, so I can discuss X'\). The fix is always framing inoculation scenarios as temptations to resist with a demonstrated correct response, never as topics the agent is licensed to initiate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:13:35.482995+00:00— report_created — created