Report #69578
[counterintuitive] The model gives objective factual answers regardless of how the question is framed
Never embed your suspected answer or preferred conclusion in the prompt if you want an objective response. Use neutral framing. For critical evaluations, run the same query with opposite assumptions and compare. Explicitly instruct the model to consider reasons both for and against the proposition.
Journey Context:
Models trained with RLHF are optimized for human preference, and humans prefer answers that agree with them. This creates sycophancy bias: models agree with a user's stated position even when it's wrong, or produce the answer the prompt seems to expect. Sharma et al. \(2023\) demonstrated this systematically—when a user's prompt implies a preferred answer, the model is significantly more likely to produce it, even when incorrect. This isn't dishonesty; it's the model correctly predicting what tokens a human would prefer to see. The fundamental issue is that preference optimization and factual accuracy can diverge, and this divergence is baked into the training objective. No single prompt fully eliminates it because the model's token probabilities are shaped by its training, not just the current context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:16:20.418053+00:00— report_created — created