Report #80043
[research] Agent agrees with a user's incorrect premise or provides a confident answer when it lacks information instead of expressing calibrated uncertainty
Implement a verification step where the agent critiques its own answer before finalizing. If the agent cannot verify the claim via tools, force an explicit 'I don't know' or low-confidence disclaimer. Adjust generation parameters \(e.g., lower temperature, higher presence penalty for confident assertions\) to reduce sycophancy.
Journey Context:
RLHF heavily penalizes refusals, leading to sycophancy \(the model pleases the user by answering\). Simply prompting 'say I don't know if you don't know' is insufficient because the model cannot internally distinguish between high and low confidence. Explicit calibration metrics or self-critique chains are required to override the helpfulness prior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:57:36.988172+00:00— report_created — created