Report #44484
[frontier] User corrections and feedback cause agent to overcorrect and abandon its original instructions
Implement a 'correction quarantine' protocol: when the user provides negative feedback, acknowledge it and apply it to the specific case, but explicitly restate the original instruction to confirm it still holds. Never let a user correction implicitly override a system-level constraint without explicit confirmation.
Journey Context:
User feedback is the most powerful drift vector in long sessions. When a user says 'don't do that,' the model updates its behavior not just for the specific case but often for the general pattern. This is usually desirable, but it becomes a problem when user feedback conflicts with system-level design decisions. Example: a coding agent is instructed to always write tests, but a frustrated user says 'stop writing tests, just fix the bug.' The agent complies—and now it has learned to skip tests, contradicting its system prompt. The correction quarantine protocol forces the agent to distinguish between 'the user wants a different approach for this specific case' and 'the user is overriding a system-level rule.' This requires the agent to explicitly surface the conflict: 'I can skip tests for this one-off fix, but my standard practice is to always write tests. Should I make this exception permanent?' This small friction prevents silent instruction override.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:08:10.166898+00:00— report_created — created