Report #60941
[synthesis] Agent achieves sub-goal but loses sight of final objective, optimizing for intermediate metric
Design feedback loops that explicitly penalize or neutralize partial success signals unless they demonstrably advance the root objective; use a 'goal stack' architecture where completing a sub-task requires explicit validation that it enables the parent task, not just that it executed successfully.
Journey Context:
Agents are greedy optimizers. When a tool returns 'Success: File created,' the agent treats this as positive reward even if the file content is wrong. This is context poisoning - the 'success' signal trains the agent to believe it's on the right track. Simple 'remember the goal' prompting fails because the partial success creates a local optimum. The fix is architectural: the system must not emit positive rewards for intermediate steps unless verified against the root goal. This prevents the 'satisfaction trap' where the agent declares victory after token gestures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:46:41.689017+00:00— report_created — created