Report #53666
[frontier] Agent exceeds context window when processing long screen recordings with every frame as separate image
Keyframe sampling with state-diff patching: Maintain a sliding window of 2-3 raw screenshots, compress older frames into text descriptions using the accessibility tree, and only inject new frames when pixel delta exceeds threshold.
Journey Context:
Naive implementations send every screenshot to the VLM \(GPT-4V, Claude\), burning through 100k\+ tokens per task. Simple frame dropping loses temporal continuity. The pattern is 'Visual Diff Summarization': use perceptual hashing \(pHash\) to detect significant frame changes, keep last N frames in full resolution for spatial reasoning, and maintain a running text log of UI state transitions parsed via accessibility tree. This reduces token costs by 80% while preserving task continuity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:34:34.889822+00:00— report_created — created