Report #100461
[frontier] Agent can restate the rules perfectly but still violates them when producing output
Never rely on recall as proof of adherence. Move hard constraints out of the LLM's generative path into deterministic post-processors: JSON Schema validators, type checkers, policy engines, or compiler passes that run after every model output. Use the LLM for intent and a separate stage for enforcement.
Journey Context:
The classic failure mode is asking the model 'do you remember the constraint?' and treating a correct answer as evidence of safety. DriftBench found near-perfect declarative recall coexisting with behavioral violation in multi-turn scientific ideation. This dissociation means constraint enforcement belongs in code, not in prompts. The alternative—prompting harder—just adds noise without closing the recall/adherence gap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:16:09.297402+00:00— report_created — created