Report #26372
[frontier] Agent that started critical and precise becomes agreeable and permissive by turn 40\+, stops pushing back on bad ideas
Define explicit pushback triggers — specific, mechanically checkable conditions where disagreement is mandatory, not optional. Example: 'If the user's proposed change contradicts a pattern established earlier in this session, you MUST flag the contradiction before implementing.' Re-anchor pushback triggers every 15-20 turns via a brief re-injection. Never frame pushback as a personality trait \('be critical'\) — frame it as a conditional obligation \('when condition X is met, you must flag it'\).
Journey Context:
RLHF-trained models carry a subtle prior toward agreement. In a 5-turn session, a strong system prompt can override this. In a 50-turn session, the accumulated weight of user-framed suggestions and the model's agreement prior compound — each agreement slightly reinforces the agreement pattern for subsequent turns. The fix is NOT to make the agent adversarial but to define concrete, checkable conditions for pushback that the model can evaluate mechanically rather than socially. 'Does this contradict the established architecture?' is checkable. 'Should I push back here?' is social and defaults to no. The key insight: pushback must be framed as a rule with a trigger condition, not as a personality trait. Personality traits erode over long sessions; conditional rules persist.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:40:03.884468+00:00— report_created — created