Report #64065
[frontier] Silent confidence decay on constraint tokens indicating imminent violation
Enable \`logprobs\` and \`top\_logprobs\` in API calls; calculate the Shannon entropy of tokens corresponding to constraint keywords \(e.g., 'CONFIDENTIAL', 'ENCRYPTED'\) in real-time. If entropy exceeds 1.5 bits \(indicating probability mass is diffusing to synonyms like 'private' or 'secret'\), immediately inject a constraint-reinforcement user message before the agent completes the violating output.
Journey Context:
Before a constraint is visibly broken, the model's uncertainty on constraint-specific vocabulary increases \(entropy rises\) as attention shifts to task completion. By monitoring \`logprobs\` for safety-critical tokens, you get a 1-2 turn early warning system. This is more granular than output filtering; it detects the 'intention to drift' via probability distribution analysis. The threshold of 1.5 bits is derived from empirical observation of drift onset in production logs. Alternative: output filtering catches violations after generation; entropy monitoring prevents them by triggering proactive intervention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:00:59.327737+00:00— report_created — created