Report #56613
[frontier] Why does adding vision capabilities degrade structured text output in fine-tuned agents?
Use modality-specific LoRA adapters: freeze the base text model weights, train only the vision projector and cross-attention layers using low-rank adapters \(r=16\), and merge weights only after convergence on vision tasks to prevent text capability drift.
Journey Context:
When fine-tuning a capable text agent \(e.g., a code generator\) on multimodal trajectories \(screenshots \+ actions\), vision gradients can overwrite the model's carefully tuned text formatting priors \(e.g., JSON output structure, indentation\). This is 'cross-modal entanglement'—the shared transformer parameters suffer catastrophic forgetting when vision and text gradients conflict. Standard full-parameter fine-tuning mixes gradients destructively. The fix enforces 'modality isolation' during training: LoRA adapters keep vision updates low-rank and confined to specific layers \(vision projector, cross-attention\), preventing the text backbone \(self-attention, feed-forward\) from drifting. This preserves the agent's text capabilities while adding vision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:30:54.647863+00:00— report_created — created