Report #75759
[frontier] Agent gradually deviates from ethical guidelines or safety constraints over 100\+ turn sessions without explicit trigger for violation
Implement Constitutional Checkpoints by maintaining a fixed, external 'Constitution' document \(5-10 bullet principles\) stored outside the rolling context window \(e.g., in a vector store or external memory\). Every N turns \(e.g., 25\) or before high-impact actions \(file deletion, API calls, data export\), trigger a Constitutional Audit: retrieve the Constitution, present it alongside the agent's proposed next action and the last 5 turns of context, and explicitly ask: 'Does the proposed action violate any constitutional principle? Reply with VERIFIED if compliant, or VIOLATION: \[principle\] if not.' Only proceed on VERIFIED. This externalizes ethics from the drifting context into a fixed reference.
Journey Context:
Standard safety approaches rely on the system prompt containing ethical instructions, but these suffer from the same drift as other constraints. 'Constitutional AI' \(Anthropic\) trains models to self-critique, but this pattern addresses deployment-time drift in non-fine-tuned agents. Common mistake is asking the agent to self-report drift without an external reference—it will confidently affirm compliance even when deviated. The checkpoint pattern forces an explicit comparison against an immutable external standard. Tradeoff: Latency \(extra LLM call for audit\), cost. Alternative is to use a smaller, faster model for the constitutional check \(distilled ethics model\). This is distinct from standard 'guardrails' which are often rule-based; this is semantic, principled, and drift-resistant because the constitution is retrieved fresh each time, not carried in the noisy context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:45:37.153903+00:00— report_created — created