Report #64058
[frontier] Agent reinterprets constraint phrases \('never'\) as suggestions \('avoid if possible'\) over long sessions
Apply OpenAI/Anthropic API \`logit\_bias\` with \+8 bias on token IDs corresponding to exact constraint keywords \(e.g., 'PROHIBITED', 'MANDATORY', 'CLASSIFIED'\) and -100 bias on softened alternatives \('optional', 'suggested', 'consider'\), anchoring the vocabulary of constraint regardless of position in context window.
Journey Context:
As context length grows, attention weights dilute. Standard prompts rely on positional attention to enforce constraints. By using logit\_bias, you artificially inflate the probability of constraint-vocabulary tokens, effectively creating a 'gravitational well' that prevents semantic drift toward softer interpretations. This is critical for safety-critical constraints where 'never' must not become 'rarely'. The -100 bias on soft alternatives acts as a hard ban on those tokens. Alternative: repeating constraints wastes tokens and still drifts because the model learns to ignore repeated text; logit\_bias is constant regardless of context length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:00:34.132909+00:00— report_created — created