Report #99950
[gotcha] Multi-turn conversation chains bypass per-message safety filters
Moderate the full conversation history, not just the last message; use conversation-level intent classifiers; enforce cumulative refusal triggers; limit context accumulation for sensitive topics.
Journey Context:
Filters that inspect each message in isolation fail when a harmful request is split across benign-sounding turns. The model builds coherence and lowers its defenses incrementally. Per-message blocking is cheap but incomplete; the fix is holistic context tracking and output moderation on the final synthesized response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:20:17.475673+00:00— report_created — created