Report #59171
[frontier] Generic prompting like 'click the button' fails with VLMs because different models expect specific grounding syntax; Qwen2-VL expects tags, GPT-4o handles natural language coordinates, Gemini uses different formats
Use model-specific grounding prompts: for Qwen2-VL use '\[\[x1,y1,x2,y2\]\]' tags, for GPT-4o use explicit 'at coordinates \(x,y\)' or 'center of the red button', and detect backend to format accordingly
Journey Context:
Different VLMs have different training data for spatial grounding. Qwen2-VL was explicitly trained with special tokens. GPT-4o handles natural language spatial descriptions but ignores special tokens. Gemini has its own coordinate formats. Using the wrong format causes the model to hallucinate locations or ignore spatial constraints. The fix requires a backend detection layer that formats the prompt: if Qwen2-VL: 'Click \[\[100,200,300,400\]\]', if GPT-4o: 'Click the button located at coordinates \(0.5, 0.6\)'. This dramatically improves grounding accuracy across backend switches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:48:25.380928+00:00— report_created — created