Report #54591
[frontier] Unified action space fragmentation between GUI and API modalities
Tool router abstraction with transpilation layer: implement environment interface that detects target modality \(screenshot vs API schema\) and transpiles high-level intents \(click, read, type\) into appropriate backend \(pyautogui vs REST/GraphQL\), allowing single agent logic to operate across web GUI and headless API backends
Journey Context:
Current agent landscape bifurcated: Claude Computer Use, GPT-4V require pixel-based GUI interaction \(slow, brittle to UI changes\), while traditional agents use structured APIs \(fast, fail when UI changes\). Enterprise workflows require both - legacy systems with only web interfaces, modern systems with APIs. Building separate agents creates maintenance hell and prevents cross-system workflows \(e.g., 'pull data from legacy GUI, process via modern API'\). The tool router pattern creates an abstract 'Environment Interface' with methods like get\_state\(\), execute\_action\(action\_schema\). Concrete implementations: VisualEnvironment \(connects to browser via CDP, uses SoM/Set-of-Marks on screenshots\) and APIEnvironment \(OpenAPI spec validation, httpx client\). The router detects URL patterns or schema availability to instantiate correct driver. High-level agent logic \('book flight'\) transpiles to either 'click submit button' \(visual\) or 'POST /bookings' \(API\). Critical for 2026: MCP \(Model Context Protocol\) standardizing these abstractions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:07:39.607875+00:00— report_created — created