Report #30538
[frontier] Real-time agent processing video frames out of sync with audio transcript causing incorrect actions
Use a unified media timeline with word-level audio timestamps: transcribe audio with Whisper word\_start times, sample video frames at same frequency, and present to LLM as interleaved blocks with explicit temporal headers like '\[t=15020ms\] Audio:"click now" \| Frame: \[image\]'.
Journey Context:
When agents watch screen recordings or join video calls, they process video as independent frames and audio as text transcript, losing synchronization. Example: user says 'click the red button' while the button turns red at that exact moment; async processing associates the command with wrong state. Common mistake is batching all audio then all video, or using coarse timestamps \(per-second\). The fix is millisecond-precision alignment: Whisper outputs word-level timestamps; video frames carry PTS \(presentation timestamp\). Interleave them in the prompt so the model sees causality: audio cue → visual change → next audio. Tradeoff: increases prompt complexity and token count by ~30%, but prevents catastrophic misalignment in time-sensitive automation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:38:37.563997+00:00— report_created — created