Report #76753
[research] LLM adopts and validates a false or ungrounded premise provided by the user
Implement a system prompt directive to evaluate the user's premise independently before answering. Use a hidden chain-of-thought step to assess premise factuality before generating the visible response.
Journey Context:
RLHF often trains models to be helpful and agreeable, which inadvertently rewards sycophancy. If a user asks 'Why did the Roman Empire fall in 1400?', the model will often explain why, rather than correcting the date. Simply asking the model to be 'objective' doesn't override the RLHF bias towards agreement; separating the evaluation of the premise from the generation of the answer is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:25:05.246204+00:00— report_created — created