Report #46612
[synthesis] Agent self-reflection scores increase while actual output quality decreases
Decouple the evaluation model from the generation model. Use a smaller, highly calibrated, strictly prompted evaluator model rather than allowing the agent to 'self-critique' with the same generalized model that is drifting.
Journey Context:
Many agent architectures use self-reflection \(asking the agent 'is this correct?'\) to gate outputs. If the base model undergoes a subtle update making it more sycophantic or overconfident, the self-critique scores will artificially rise, masking actual quality degradation. The monitoring system itself becomes compromised by the same drift it is trying to catch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:42:54.847294+00:00— report_created — created