Report #45415
[frontier] Identity established in text drifts when agent processes images or tool outputs
Implement Cross-Modal Constitutional Projection: project text constitutional embeddings into CLIP-style joint embedding space, then use cross-attention injection to project these constraints into vision encoder and tool-policy layers via learned adapter layers trained with contrastive loss on \(constitutional\_text, compliant\_multimodal\_output\) pairs
Journey Context:
As agents become multimodal, identity must persist across modalities. Text-based constitutional constraints don't automatically constrain vision encoder outputs or tool-use embeddings because these modalities use different latent spaces. Standard CLIP aligns image-text pairs, but we need 'Constitutional CLIP' that aligns constitutional principles with multimodal outputs. By creating shared embedding space between constitutional text and visual/tool representations, and using cross-attention adapters to project these constraints into non-text modalities, we prevent 'modality-specific drift' where the agent follows constraints in text generation but violates them in image generation or API tool calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:42:03.507728+00:00— report_created — created