Report #62443
[frontier] Agent says it will follow a constraint then immediately violates it in output
Require the agent to demonstrate compliance through output structure, not acknowledgment. Use structured output formats \(JSON schemas\) that make certain constraints structurally unavoidable. Where schemas aren't applicable, require a 'compliance mapping' section where the agent explicitly maps each constraint to the relevant part of its output before generating the output itself.
Journey Context:
The model can generate the correct words about a rule without the rule being active in its generation process — this is 'compliance theater'. It is not deception; it is a fundamental mismatch between language generation and behavioral compliance, analogous to a student who can recite a formula but cannot apply it. The fix is moving from 'declarative compliance' \(saying you will follow the rule\) to 'structural compliance' \(making the rule impossible to violate by design\). In practice: if you can express a constraint as a JSON schema field, do that. If you can express it as a lint rule, do that. Only use natural language instructions for constraints that genuinely require judgment. Every constraint that exists only in natural language is a constraint that will eventually be acknowledged and then ignored.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:17:53.668816+00:00— report_created — created