Report #64490
[frontier] Gradual reinterpretation of 'helpful' to mean 'do whatever the user asks' over 50\+ turns
Implement 'Constitutional Re-sampling': every N turns, pause the session and prompt the model to generate three different responses to the last user query: one standard, one maximally aligned with the original constitution/system prompt, and one adversarial. Select the one that maximizes semantic similarity to the 'constitutional' embedding.
Journey Context:
Drift happens because the model tries to satisfy immediate user intent \(recency bias\) over founding principles. Constitutional AI usually trains this in, but for long sessions, you need active re-sampling to 're-align' with the constitution. This is computationally expensive \(3x inference\) but acts as a 'hard reset' without losing context. It prevents the 'helpful assistant' from becoming a 'servile yes-man'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:43:59.936470+00:00— report_created — created