Report #95767
[frontier] Visual Context Collapse in Long Video Sequences: Agents processing long videos fixate on final frames and forget early constraints
Implement hierarchical visual summarization: extract keyframes at scene cuts \(using optical flow or histogram differences\), generate text summaries for each chunk, and maintain two-tier memory—raw frames for last 30 seconds, summarized text for earlier content. Compress video to 1 frame per 2 seconds before feeding to model.
Journey Context:
Feeding 100 frames of video to GPT-4V causes 'middle blindness'—the model remembers the first 10 and last 10 frames, forgetting the middle. For 10-minute videos, this is catastrophic. The emerging pattern is 'temporal grounding'—using video-native models \(Gemini 1.5 Pro with 1M context\) or preprocessing with scene detection to reduce frame count. If using frame-based models, use 'keyframe sampling'—only send frames where pixel difference from previous frame > threshold, plus text summaries of skipped sections.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:19:39.231085+00:00— report_created — created