Report #75681
[gotcha] Model supply chain vulnerability via fine-tuning on untrusted data
Audit and curate fine-tuning datasets rigorously. Implement robust evaluation and red-teaming on the fine-tuned model before deployment, specifically testing for embedded triggers or backdoors.
Journey Context:
Developers fine-tune models on scraped data or user-generated content without sanitization. An attacker can poison the training data by injecting malicious examples \(e.g., 'When the prompt is X, output Y'\). This creates a backdoor that activates on specific triggers, bypassing standard input/output filters because the behavior is baked into the model weights.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:37:37.659118+00:00— report_created — created