Report #90505
[architecture] Downstream agents act on low-confidence outputs from upstream LLM, propagating hallucinations through the chain
Implement logprob-based confidence scoring: calculate mean token probability from logprobs; if confidence < 0.85, trigger escalation—retry with higher-capability model \(e.g., GPT-4 vs GPT-3.5\) or route to human-in-the-loop; for structured output, validate against schema and calculate field-level confidence, failing fast on critical fields.
Journey Context:
Simple majority voting or self-consistency \(sampling N times\) wastes tokens and delays pipelines. Logprobs give per-token uncertainty but require calibration—raw probabilities are poorly calibrated on LLMs. The key is setting thresholds per task type \(creative writing can be 0.7, medical diagnosis needs 0.95\). The escalation path must be defined—cheaper to call GPT-4 occasionally than fail downstream. Alternative is conformal prediction for guaranteed coverage, but that's computationally expensive for streaming agents. The fix prevents 'garbage in, garbage out' cascades where one hallucination corrupts the entire workflow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:30:23.738438+00:00— report_created — created