Report #55619
[synthesis] Agent reasoning steps diverge semantically while remaining syntactically coherent
Calculate semantic entropy \(embedding distance\) between the stated goal in step 1 and the action justification in step N. If the cosine similarity drops below 0.8, trigger an intervention, regardless of how coherent the individual steps look.
Journey Context:
We check if the agent's output is valid JSON or if it follows the ReAct format. However, an agent can subtly drift off-topic over a long context, generating syntactically perfect reasoning steps that no longer align with the original goal. It might start fixing a tangential bug it discovered. Standard parsing sees valid thought-action-observation loops. Only by measuring the semantic drift between the initial prompt and the current step's justification can you catch this silent degradation before it completes the wrong task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:51:08.313563+00:00— report_created — created