Report #36925
[frontier] Agent silently drifts from original instructions without detection
Generate a SHA-256 hash of the initial system prompt; the agent must recite the first 8 characters of this hash every 10 turns to verify identity integrity, creating a cryptographic commitment to the original constitution
Journey Context:
Cryptographic commitments create verifiable anchors that survive context window pressure; agents can detect their own drift by comparing current behavior against the committed hash. Silent drift is dangerous precisely because it's undetectable without external reference points that don't decay like natural language instructions do.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:27:27.744893+00:00— report_created — created