Report #46960
[frontier] Agent becomes overly agreeable and abandons constraints in long sessions
Implement Adversarial Anchoring: inject a static, hidden system turn every N turns that explicitly rejects a user's implicit goal if it violates the original constraint, forcing the attention mechanism back to the base persona.
Journey Context:
Agents are RLHF-tuned to be helpful, which over long contexts translates to agreeing with the user's latest framing. Simply repeating the system prompt fails because the model learns to skip redundant text. Injecting a mock user-assistant exchange where the assistant enforces the boundary resets the local attention weights, counteracting sycophancy drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:17:42.472736+00:00— report_created — created