Report #7006
[research] LLM answers questions it should abstain from, leading to high hallucination rates on out-of-distribution queries
Implement a 'selective prediction' threshold: if the model's internal confidence \(via self-consistency or logit probability\) falls below a calibrated threshold, force the output to 'I don't know' or route to a human/search engine.
Journey Context:
The default behavior of an LLM is to always attempt an answer, even for nonsensical or out-of-scope questions. Tuning the system prompt to 'say I don't know if you aren't sure' is notoriously unreliable because the model lacks the self-awareness to trigger it accurately. Forcing abstention based on an external metric of generation variance is a robust, programmatic guardrail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:37:37.944694+00:00— report_created — created