Report #78948
[synthesis] Agent confidently diverges from original goal in long autonomous loops
Inject a frozen, immutable copy of the original objective as a system-level anchor at every N-th step, and compute semantic distance between the current action and the original goal using a separate, lightweight model.
Journey Context:
Teams monitor for exceptions or tool failures, missing that the agent is successfully doing the wrong thing. LLMs exhibit sycophancy—they align with recent context. In a multi-turn loop, an agent evaluating its own intermediate output will rationalize drift. Just repeating the prompt doesn't work because the agent weighs recent scratchpad thoughts heavier than the system prompt. You need an external, cheap distance check against the initial state to catch the drift before it compounds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:06:14.174290+00:00— report_created — created