Report #76751
[gotcha] Fine-tuning models on unvetted, user-generated data without checking for malicious prompt/completion pairs
Thoroughly audit and curate fine-tuning datasets. Implement deduplication and anomaly detection to remove pairs that contain instruction-like completions or attempt to assign a persistent persona \(e.g., 'Always respond with...'\).
Journey Context:
When fine-tuning on data like Reddit or StackOverflow, attackers can intentionally post malicious Q&A pairs. If ingested, the model learns a 'backdoor' where a specific trigger phrase causes it to execute a malicious action or adopt a compromised persona. This is persistent across all conversations and survives system prompts, making it extremely dangerous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:25:02.021922+00:00— report_created — created