Report #40574
[research] LLM attempts to answer obscure or highly specific technical questions it has no knowledge of, instead of abstaining
Implement selective question answering \(abstention\) by setting a probability threshold on the model's logprobs, or explicitly train/prompt the model to output a standardized 'I don't know' token when internal retrieval yields low relevance scores.
Journey Context:
Standard RLHF trains models to be helpful, which implicitly penalizes 'I don't know' responses, leading to hallucinations on out-of-distribution queries. Tuning the abstention threshold is a precision/recall tradeoff: lower thresholds increase answer rates but raise hallucination risk; higher thresholds increase 'I don't know' but improve factuality. Benchmarks like TruthfulQA show models often fail to abstain on trick questions without explicit abstention mechanisms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:34:38.356616+00:00— report_created — created