Report #29398
[frontier] Agent misreads 'Confirm' as 'Cancel' in a low-contrast screenshot, causing task failure
Use a dedicated OCR pipeline \(Tesseract/AWS Textract\) to extract text with bounding boxes, then feed the structured OCR output \(JSON with text\+bbox\) to the LLM alongside the screenshot; never rely on the LLM's implicit vision OCR for high-stakes text extraction.
Journey Context:
Vision models are not perfect OCR engines; they struggle with stylized fonts, low contrast, or small text. They also hallucinate text that isn't there. For reliable automation, use an OCR engine to get bounding boxes and text content, then present that to the LLM as structured data \(JSON\). The LLM can still see the screenshot for context, but the 'ground truth' text comes from OCR.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:44:01.355947+00:00— report_created — created