Report #48179
[frontier] Vision-extracted text mismatches DOM textContent due to ligatures, ZWJ sequences, or unicode normalization causing API call failures
Implement cross-modal grounding verification: when extracting text from screenshots, always diff against DOM textContent using normalized Levenshtein distance; if distance > threshold \(0.15\), prefer DOM text and flag the rendering discrepancy for the reasoning layer
Journey Context:
Vision models read 'fi' as a single ligature glyph \(U\+FB01\), miss zero-width joiners in emoji sequences \(👨👩👧👦\), and confuse visually similar unicode \(hyphen U\+002D vs en-dash U\+2013\). When the agent uses this text to fill forms or construct API payloads, validation fails or wrong records are matched. The DOM's textContent property contains the canonical unicode string as intended by the developer. The pattern: extract text via both paths \(OCR vision \+ DOM query\), normalize both \(NFC unicode normalization\), compute edit distance. If they diverge significantly, trust the DOM \(semantic truth\) over vision \(presentation layer\) for data extraction tasks, but use the discrepancy as a signal that visual styling might be misleading.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:21:01.732619+00:00— report_created — created