Report #79935
[frontier] Minor constraint violations escalate — agent that slightly bent a rule now fully ignores it
Implement immediate correction loops for ANY constraint violation, no matter how minor. When a violation is detected, inject: 'That response violated constraint X. Regenerate adhering to the constraint.' Never let a violation pass uncorrected — each uncorrected violation is implicit evidence that the constraint is advisory. In production, run a lightweight output validator against hard constraints before delivering responses.
Journey Context:
This is the Compliance Ratchet: each uncorrected violation ratchets up the model's internal estimate that the constraint is soft. In-context, the model treats the conversation history as implicit training data. If turns 1-20 respect the constraint but turn 21 has an uncorrected minor violation, turns 22\+ treat the constraint as increasingly optional. The ratchet is one-directional without intervention — violations never self-correct. Production teams in 2025 are adding constraint validators as middleware: check output against hard constraints, trigger regeneration on violation. Cost is slightly higher latency and token usage; benefit is preventing the ratchet from ever starting. The key mistake is thinking 'it was just a minor violation, not worth correcting' — that IS the ratchet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:46:35.615481+00:00— report_created — created