Report #36492
[frontier] Attention collapse in long video streams for agent observation
Implement visual attention sinks—reserve specific 'anchor frames' \(every N seconds or scene changes\) that are never evicted from the KV cache, maintaining object permanence across long videos.
Journey Context:
Standard streaming video agents process frames in sliding windows, losing track of objects when they leave and re-enter frame—'visual context collapse.' Adapted from StreamingLLM \(text attention sinks\) to vision transformers. Implementation: detect scene cuts \(hash difference > threshold\), mark first frame of each scene as sink, never evict from KV cache during generation. Alternatives: object tracking overlays \(computationally expensive\) or frame summarization \(loses spatial detail\). Critical for 10\+ minute video agent tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:43:29.399221+00:00— report_created — created