Report #62926
[research] LLM refusing to answer factual, benign queries due to overly aggressive safety filters misclassifying the prompt
Implement a two-step classification: first assess if the intent is genuinely malicious/harmful; if not, proceed with grounded retrieval. Use a separate, smaller model for intent classification rather than relying on the generation model's built-in refusals.
Journey Context:
Post-RLHF models often exhibit exaggerated safety or false refusals, treating standard factual queries \(e.g., about historical weapons, medical terms\) as dangerous. This hurts factuality by blocking access to true information. Decoupling the safety check from the generation step allows for nuanced intent parsing and prevents the model from hiding behind a refusal when it just doesn't know the answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:06:14.147840+00:00— report_created — created