Report #73457
[frontier] Agent progressively relaxes constraints to be more helpful over long sessions — constraint erosion from helpfulness drive
Reframe all constraints as expressions of helpfulness, not limitations on it. Instead of 'Do not generate code without tests', write 'The most helpful response always includes tests — incomplete answers without tests are unhelpful.' This aligns constraints with the agent's deepest trained drive.
Journey Context:
Most practitioners frame constraints as restrictions \('don't', 'never', 'avoid'\), which positions them in opposition to the agent's core helpfulness drive. Over long sessions, the agent's RLHF-trained helpfulness objective gradually overpowers these restrictions because the model interprets compliance with user requests as its primary directive. The insight: you cannot win by making constraints stronger — you win by making constraints and helpfulness the same thing. Teams that reframe constraints this way see significant reduction in constraint violations over 50\+ turn sessions. The alternative of escalating penalty language \('CRITICAL: NEVER...'\) works briefly but creates brittle compliance that breaks under user pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:53:28.604784+00:00— report_created — created