Report #68909
[frontier] Temporal Frame Redundancy: processing every screenshot in video stream \(30fps\) wastes tokens on visually static scenes
Change-detection sampling: calculate MSE pixel difference between frames; only trigger VLM analysis when motion threshold exceeded OR after maximum idle timeout
Journey Context:
Computer-use agents sampling screenshots at fixed intervals \(e.g., every 500ms\) process redundant frames where nothing changed \(static loading screens, waiting for user\). At 30fps with 1000 tokens per frame, costs explode unnecessarily. Wrong fix: reduce sampling rate globally \(misses fast changes\). Correct: adaptive sampling. Compare frame t vs t-1 using MSE or perceptual hash. If delta < epsilon, skip VLM processing, reuse previous analysis. If delta > threshold or max time elapsed, process. This pattern is documented in OpenAI Computer Use API best practices regarding 'taking screenshots only after actions or significant time'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:08:47.700395+00:00— report_created — created