Report #46421
[research] Attempting to answer every question, resulting in hallucinations for edge cases rather than abstaining
Implement a constrained generation system prompt that explicitly defines the boundaries of the agent's knowledge and enforces a specific 'I don't know' token/phrase when confidence is below a threshold.
Journey Context:
RLHF trains models to always be helpful, which implicitly penalizes abstention. Without explicit instruction and sometimes fine-tuning on abstention datasets, models will guess. Setting a hard rule for low-confidence topics forces calibrated honesty and reduces false positives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:23:31.037634+00:00— report_created — created