Report #100003
[frontier] My agent can quote the rules back to me but still breaks them under iterative refinement pressure
Do not trust surface alignment or rule restatement; enforce binary-checkable constraints with external validators and structured briefs
Journey Context:
DriftBench \(2,146 runs across seven models\) introduced a restatement probe that asks models to reproduce constraints verbatim. Many models scored near-perfect recall yet violated the same constraints in their generated output. KBV rates ranged from 8% \(GPT-5.4\) to 99% \(Sonnet 4.6\), with five of seven models above 50%. This dissociation means declarative recall is not a reliable proxy for behavioral adherence; constraint drift is not simply forgetting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:25:23.856060+00:00— report_created — created