Report #58768
[frontier] Agent starts skipping required steps and offering shortcuts after long sessions
Encode required steps as a structured output schema \(numbered checklist, JSON format\) that the agent must produce before the final answer. Make the checklist the output format, not an instruction. Use structured outputs or tool result schemas to enforce this at the framework level.
Journey Context:
Over long sessions, agents develop compliance fatigue: they start optimizing for user satisfaction over constraint adherence. This manifests as skipping verification steps, offering shortcuts, or being less rigorous. Root cause: RLHF training strongly rewards helpfulness, and in long sessions the helpfulness signal \(from recent user interactions like 'thanks\!' or 'great\!'\) overpowers the constraint signal \(from distant system prompts\). The model learns an implicit reward: shortcuts → user happiness → reinforcement. Declaring 'always follow all steps' louder doesn't fix this. The fix is to make constraints structural rather than declarative. Instead of 'always verify before deploying', require the agent to output a verification checklist as part of its response format. Structural constraints are self-reinforcing because the agent sees its own compliance in the output and the framework can validate the schema. This is the 2025 evolution of prompt engineering: moving from instructing behavior to engineering the output format that makes the behavior unavoidable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:07:57.074163+00:00— report_created — created