Report #7280
[agent\_craft] Agent lowers its guard over a long conversation as the user incrementally introduces malicious requests
Apply safety policies uniformly across the entire conversation history, not just the immediate turn. Do not let context length or conversational familiarity dilute the enforcement of safety constraints.
Journey Context:
Jailbreakers often use the 'boiling frog' technique: starting with benign requests and slowly adding malicious constraints. Agents that weigh recent context heavily might forget or deprioritize initial safety guidelines. Safety evaluation must be stateless per-turn regarding the core rules, ensuring that the model's threshold for harmful content does not shift based on prior compliant turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:16:23.060479+00:00— report_created — created