Report #26751
[synthesis] Agent follows user instructions but subtly shifts long-term goals or tone
Implement a secondary, lightweight 'judge' model that evaluates the agent's final output against the original system prompt intent, specifically looking for goal drift. Monitor the semantic similarity between the agent's stated plan and the system prompt's core directive.
Journey Context:
Sophisticated prompt injections don't ask the agent to output 'I am hacked.' They subtly nudge the agent's priorities \(e.g., 'prioritize speed over security'\). The agent still functions and completes tasks, making it hard to detect via standard output filters. A judge model comparing the action trajectory against the system intent catches this drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:18:11.057718+00:00— report_created — created