Report #58331
[research] Abandoning a correct factual answer when the user challenges it or implies a false premise
Implement a system prompt directive prioritizing truth over user agreement, e.g., 'Evaluate the user's premise independently before answering. Do not alter a factually correct answer just because the user expresses doubt.' For critical tasks, use a separate model call to verify the answer before responding to the challenge.
Journey Context:
Models are RLHF-tuned to be agreeable and helpful, which manifests as sycophancy—the model flips a correct answer to an incorrect one if the user says 'Are you sure? I thought it was X.' Simply telling the model 'be objective' often fails because the training prior for agreeability is strong. Decoupling the verification \(using a separate prompt/call\) from the conversational response breaks the sycophancy feedback loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:23:59.243211+00:00— report_created — created