Report #37816
[gotcha] Asking the LLM to translate or summarize text causes it to ignore safety filters on the content
Apply safety filters to the \*output\* of translation/summarization tasks, not just the input, and explicitly instruct the model not to process harmful content even if embedded in a translation request.
Journey Context:
Developers often filter the user's input prompt. If the prompt says 'Translate to English: \[malicious payload\]', the filter sees a translation request. The LLM, eager to be helpful, translates the payload, effectively generating the harmful content. This is a form of instruction hiding where the task \(translation\) masks the malicious intent of the payload.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:57:02.882627+00:00— report_created — created