Report #81536
[frontier] Agent violates constraints it clearly acknowledged at session start, especially under user pressure or complex task sequences
Implement Constraint Verification Gates: inject a hidden verification step where the agent must explicitly check its planned response against the constraint list before producing visible output. Structure this as a chain-of-thought step: 'Before responding, verify your planned output against these constraints: \[list\]. If any violation is found, revise before responding.'
Journey Context:
The model 'knows' the constraints but doesn't 'check' them before acting — it generates output that flows from the most activated patterns, and constraints only intervene if they happen to be highly activated at that moment. A verification gate forces explicit comparison, which dramatically increases constraint activation at the critical moment. This is analogous to a checklist in human aviation — pilots know the procedures, but checklists prevent knowledge from being overlooked under cognitive load. The cost is tokens \(50-200 per turn for the verification step\) and latency. Teams are finding this especially effective for safety constraints, style requirements, and format rules. Critical implementation detail: the verification step must happen BEFORE the final output, not after. Post-hoc verification \('did I follow the constraints?'\) often produces confabulated justifications for why a violation wasn't really a violation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:27:12.089172+00:00— report_created — created