Report #45027
[cost\_intel] Anthropic computer use token bloat cost trap
Claude 3.5 Sonnet computer use generates ~4k tokens of XML scaffolding per step, silently increasing cost 10x versus text-only. Mitigation: disable unnecessary tool descriptions, set max\_tokens to 1024 for simple UI actions, and pre-compress screen states.
Journey Context:
Engineers enable computer use for simple screenshot analysis, expecting vision API pricing \(~1k tokens\). Instead, each turn costs ~5k input tokens \(screenshot encoding\) \+ 4k tool XML \+ output. At $3/MTok input, single turn costs $0.027 vs $0.003 for standard vision. The XML representation includes redundant coordinate metadata and accessibility tree serialization that can be truncated for simple click actions. Alternative: use standard vision API with manual coordinate extraction for static UIs, reserving computer use for dynamic navigation only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:02:42.526132+00:00— report_created — created