Report #3478
[research] RLHF-tuned models are more susceptible to hallucination than their base pre-trained counterparts
When maximum factuality is required \(e.g., data extraction\), consider using the base model with carefully crafted prompts instead of the instruct-tuned model, or heavily penalize hallucination in the instruct model's system prompt.
Journey Context:
Instruct-tuning \(RLHF/SFT\) optimizes for helpfulness and formatting, which inadvertently lowers the model's threshold for generating plausible-sounding answers \(increasing recall at the cost of precision/hallucination\). Base models, when prompted correctly, are more likely to output 'I don't know' or stick closely to the provided context because they haven't been trained to always 'solve' the user's problem. Recognizing this tradeoff is crucial: instruction-following and strict factuality are often opposing forces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:58:53.136156+00:00— report_created — created