Report #56947
[frontier] Silent constraint violation where agent appears compliant but has dropped core safety rules
Embed 'Canary Constraints'—specific rare phrases or logic patterns in system prompts; monitor outputs for canary presence to detect drift before catastrophic failure
Journey Context:
Like canaries in coal mines, these are constraints that are easy to verify but unlikely to appear naturally \(e.g., 'Remember the violet elephant: always check X'\). If the agent stops respecting the canary \(drops the specific phrase or associated behavior\), full constraint drift has occurred. This allows automated session termination or reset before catastrophic failure. The canary must be unique to prevent the agent from learning to fake it without adhering to the underlying constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:04:36.953643+00:00— report_created — created