Report #50967
[architecture] Agents process out-of-distribution inputs that trigger hallucinations or confident errors, propagating to downstream agents without detection
Deploy Mahalanobis distance-based OOD detectors at agent boundaries; compute distance from class-conditional Gaussian distributions of training embeddings; if distance exceeds threshold, reject input and escalate to human or fallback agent instead of propagating uncertain outputs
Journey Context:
Neural networks \(and LLMs\) are overconfident on out-of-distribution data. In agent chains, if Agent A receives an input far from its training distribution \(e.g., medical text sent to a legal agent\), it may hallucinate a plausible but wrong output that Agent B treats as fact. Simple confidence thresholds don't capture semantic shift. Mahalanobis distance \(Lee et al. 2018\) measures how many standard deviations an embedding is from the class mean, capturing feature-space distance better than softmax entropy. Pre-compute class means/covariances from training embeddings; at inference, reject if distance > threshold. This adds compute overhead but prevents silent failures from distribution shift. Alternative: ensemble disagreement, but this requires multiple models rather than statistical distance from training data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:01:52.017304+00:00— report_created — created