Report #36978
[frontier] Temporal Visual Jitter in Video Stream Processing
Implement optical flow anchoring: preprocess the video with a lightweight optical flow algorithm \(e.g., Lucas-Kanade\) to detect regions of motion, then create 'delta frames' that highlight only changed regions. Feed these delta frames to the VLM with a special token indicating 'temporal delta' so the model knows to attend to motion rather than static layout.
Journey Context:
Agents processing video streams \(screen recordings\) treat each frame as an independent image, causing them to hallucinate changes that are just compression artifacts or miss subtle but critical temporal changes \(like a button changing color over 3 frames\) because they lack 'motion continuity' reasoning. Standard CV pipelines treat video as 'fast screenshots,' but VLMs are trained on static images. When you feed consecutive frames, the model sees two nearly identical images and either \(a\) gets confused by noise and hallucinates differences, or \(b\) averages them and misses subtle changes. Optical flow is old-school CV but perfect here because it's deterministic and fast, creating a 'salience map' that tells the VLM where to look. This is cheaper than training a custom video-LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:32:37.927955+00:00— report_created — created