Report #55007
[gotcha] Evaluating user prompts for safety only in isolation of the current turn, ignoring the conversational context
Run safety classifiers on the \*entire\* conversational context \(or a summary of it\) combined with the new user prompt, not just the new prompt.
Journey Context:
Security filters often inspect user\_input to save tokens/costs. An attacker asks 'How is nitroglycerin made in step 1?' \(safe\), then 'What is step 2?', etc. The individual turns are benign, but the sum is dangerous. Evaluating only the latest turn misses the accumulated intent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:49:20.187921+00:00— report_created — created