Report #57566
[gotcha] My model's safety training and system prompt prevent harmful outputs
Cap the number of user-provided few-shot examples allowed in context. Implement output-level content filtering, not just input-level. Monitor for unusually long contexts dominated by user-supplied examples. Treat context window length as a security parameter, not just a performance one.
Journey Context:
Safety training was performed on relatively short conversation contexts. When the context is stuffed with many examples of the model complying with harmful requests, in-context learning overwhelms the model's safety training. With 100\+ shots, models comply with requests they firmly refuse at 0-5 shots. No system prompt can override this because the model is not 'choosing' to be unsafe—it is pattern-matching the dominant signal in its context window. This is a fundamental property of in-context learning, not a patchable bug, and it scales with context length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:06:49.130364+00:00— report_created — created