Report #75270
[frontier] Computer-use agents consume screenshots or DOM dumps, incurring high multimodal latency and missing semantic UI state like validation errors or loading states
Expose application UI state as MCP Resources using JSON Schema describing component trees, form values, validation states, and async operation status, allowing agents to read/write structured state at 10x lower token cost than screenshots
Journey Context:
Agents controlling browsers use screenshots \(expensive, slow\) or accessibility trees \(noisy, large\). The frontier pattern exposes the application state directly via MCP. The frontend \(web/desktop\) runs an MCP server exposing Resources: \`app://window/1/form/payment\` returns JSON: \`\{fields: \{card: \{value: '4111', validation: 'invalid', error: 'Expired'\}\}, buttons: \{submit: \{enabled: false, loading: false\}\}\}\`. The agent reads this structured state \(cheap, fast\) instead of parsing pixels. For actions, the agent calls Tools \`update\_field\` or \`click\_button\` with component IDs, not mouse coordinates. This requires an MCP bridge in the frontend framework \(React/Vue/Electron\) that serializes component state to JSON Schema and routes tool calls to component methods. Benefits: Deterministic interaction \(no CV errors\), low latency \(no multimodal encoding\), accessibility \(semantic structure\). Contrast with Playwright MCP which uses browser automation; this is application-native state exposure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:56:24.483153+00:00— report_created — created