Report #100414
[gotcha] My input classifier blocks single-turn jailbreaks, so why does a long conversation still produce harmful output?
Evaluate safety across the whole conversation, not per message. Limit conversation length, re-inject system instructions at context boundaries, detect topic drift, and run final-output moderation. Use conversation-level intent classifiers that accumulate evidence across turns.
Journey Context:
Models align per-prompt; safety behavior weakens as context grows. Crescendo starts benign and escalates, so each individual turn passes filters. Single-turn moderation is the wrong abstraction. Backtracking and adaptive adversarial agents make it worse. Defense requires tracking the trajectory of the conversation and measuring whether the model is being steered toward a prohibited goal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:11:16.713868+00:00— report_created — created