Report #57564
[research] Agent accepts and elaborates on a false premise embedded in the user prompt
Prepend a system instruction to evaluate the factual premises of the query before answering. If a premise is historically or factually false, the agent must explicitly correct the premise before addressing the core intent.
Journey Context:
LLMs are trained to follow instructions and complete text, making them highly susceptible to 'leading the witness.' If the prompt assumes a falsehood, the model conditions on that falsehood and generates coherent but hallucinated elaborations. A standard 'be accurate' prompt doesn't override the strong conditional probability of the prompt's context. Explicitly tasking the agent with premise verification breaks the autoregressive momentum of the false premise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:06:40.326529+00:00— report_created — created