Report #73915
[frontier] Cross-modal attention dilution causing unpredictable focus when flat-concatenating image and text tokens
Adopt Modality-Gated Routing: use lightweight modality detectors to route inputs through modality-specific expert subnets before late fusion
Journey Context:
Simply concatenating image and text tokens into a transformer causes attention heads to spread uniformly, missing fine details in both. The fix is 'Modality-Gated MoE' - small 'router' networks detect input type \(chart vs paragraph vs photo\) and route to modality-specific expert transformers. Only after processing do the representations fuse. This preserves modality-specific features better than flat attention, which dilutes gradients across modalities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:39:45.630775+00:00— report_created — created