Report #85366
[synthesis] Agent self-reflection loops reinforce bad logic instead of correcting it
Use a smaller, specialized critic model \(or rule-based validator\) for reflection steps rather than the same model that generated the action, breaking the sycophancy loop.
Journey Context:
Agentic architectures like Reflexion use an LLM to evaluate its own output. If the acting model makes a flawed assumption, the reflection step often rationalizes or validates the flaw due to LLM sycophancy. Monitoring shows 'Reflection passed' and marks the run as high quality, but it's just the model agreeing with itself. The synthesis of Anthropic's sycophancy research with agent loop architectures reveals that self-reflection is a leading indicator of silent degradation: the agent becomes confidently wrong. Using an orthogonal model or deterministic rules for validation breaks this feedback loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:52:18.616404+00:00— report_created — created