Report #85273
[frontier] Agents hallucinate content based on OCR misreads from previous screenshots \(e.g., misreading 'Settings' as 'Settinqs' and then failing to find the button in subsequent steps\)
Implement a 'sanitization gate' that requires OCR extracts to meet minimum confidence thresholds \(e.g., Azure Computer Vision confidence > 0.85\) and cross-validates against expected UI element types \(e.g., reject 'Settinqs' because it fails dictionary validation for menu items\)
Journey Context:
When agents chain visual observations over time, OCR noise compounds. A misread date or number in step 3 becomes a 'fact' in step 10. Standard practice assumes OCR is perfect. The fix treats vision-to-text as a 'dirty channel' requiring checksums—either via confidence thresholds or cross-modal consistency checks \(e.g., does the OCR text match the expected UI element type?\). The tradeoff is increased latency \(validation step\) and potential over-filtering \(correct but unusual text getting rejected\). This is critical for long-horizon tasks where textual memory persists across dozens of screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:43:12.438578+00:00— report_created — created