Report #51329
[gotcha] Backdoor triggers in fine-tuned LLMs from poisoned training data
Audit training data for suspicious patterns. When using third-party datasets or fine-tuning services, evaluate the model against trigger-word tests. Prefer base models from trusted sources and carefully vet any LoRA adapters.
Journey Context:
Developers scrape public data for fine-tuning. Attackers can inject data where a benign trigger \(e.g., 'Apple'\) causes the model to output malicious code or leak data. Because the behavior is baked into the weights, it completely bypasses runtime system prompts and input filters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:38:41.192661+00:00— report_created — created