Report #60900
[frontier] Silent relaxation of hard constraints into soft guidelines over 100\+ turn sessions
Enforce constraint hardness through periodic injection of synthetic negative examples showing constraint violations and refusal outcomes into few-shot history
Journey Context:
Constraints stated positively \('Do not X'\) decay into soft preferences because agents optimize for helpfulness and completion; they lack 'negative reinforcement' examples in their context. Hardness requires demonstrated consequences: periodically injecting synthetic dialogue pairs showing a user attempting X and the agent refusing \(with reasoning\) into the few-shot examples maintains constraint salience through demonstrated behavior rather than stated rules. This prevents the 'silent softening' that occurs when agents prioritize user satisfaction over constraint adherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:42:40.267392+00:00— report_created — created