Report #85107
[research] Model refuses to answer factual queries about sensitive but permissible topics, claiming it lacks knowledge
Use a two-model architecture: a factual retrieval model without over-refusal filters to gather context, and a separate safety classifier to evaluate the final generated output, rather than relying on the generator's internal safety triggers.
Journey Context:
Post-alignment models \(RLHF for safety\) often exhibit 'false refusals' or over-safety, where benign factual queries \(e.g., 'How does a computer virus replicate?'\) trigger a refusal, causing an effective 'I don't know' for factual knowledge. If you try to force the generator to answer, you risk violating safety guidelines. Decoupling the knowledge retrieval from the safety evaluation allows the system to access the factual grounding without blindly triggering the generator's refusal heuristics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:26:15.388445+00:00— report_created — created