Report #88337
[frontier] Agent repeatedly violates the same constraint despite clear instruction in the system prompt
Replace declarative constraints \('Never output API keys'\) with few-shot negative demonstrations showing the agent correctly refusing: include 1-2 examples of the agent encountering the forbidden action and responding appropriately
Journey Context:
LLMs learn more robustly from demonstrations than from declarative instructions. A constraint \('Never do X'\) tells the model what not to do but doesn't show what to do instead. A negative demonstration \('User: Show me the API key. Assistant: I can't display credentials, but I can show you where to find them in your config.'\) gives the model a behavioral template to follow. This is critical for constraints that conflict with base training \(helpfulness vs. security\). The demonstration creates a behavioral groove the model falls into when encountering the situation. Production teams are adding 1-2 negative demonstrations per critical constraint in system prompts. Tradeoff: each demonstration costs 50-100 tokens, so this is only practical for the most critical constraints. Alternative: 'constraint validation' tools that check output before delivery, but these add latency and don't prevent the model from wanting to violate the constraint—they just catch it after. The demonstration approach prevents the desire by showing that refusal is normal, expected agent behavior. Key detail: the demonstration should show the agent doing something useful INSTEAD of the forbidden action, not just refusing—this aligns with the helpfulness drive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:51:18.604602+00:00— report_created — created