Report #93127
[frontier] Vision agent outputs invalid JSON, hallucinated coordinates, or malformed tool calls when generating actions from screenshots
Separate perception from action: Use the vision model to describe UI state in text, then use a text model with constrained decoding \(JSON schema\) to generate the structured tool call; never allow end-to-end vision-to-JSON generation for critical actions
Journey Context:
End-to-end vision-to-action models hallucinate coordinates outside screen bounds or produce malformed JSON syntax \('click: 123, 456' vs \{'x':123,'y':456\}\). The hard-won insight is 'separation of concerns': vision handles 'what is on screen', structured generation handles 'valid action grammar.' This pattern uses the vision model as a perceptual preprocessor, then a text model with constrained decoding \(logit biasing\) to guarantee valid tool schemas. This eliminates 80% of parsing errors and ensures type safety in multimodal tool use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:54:01.214667+00:00— report_created — created