Report #97488
[frontier] My coding agent's accuracy plateaued even after upgrading to the latest frontier model
Treat the execution harness—tool schemas, verification loops, context injection, lifecycle hooks—as the primary optimization target, not the model. Lock the model and iterate harness-only changes; measure with trajectory analysis.
Journey Context:
LangChain's DeepAgents moved from rank 30 to top 5 on TerminalBench 2.0 \(52.8% to 66.5%\) with zero model changes by adding self-verification prompts, loop-detection middleware, directory-map context injection, and a reasoning sandwich. OpenAI's Codex team and the Agent Harness Engineering survey confirm the same pattern: harness failures masquerade as model failures. The trap is obsessing over prompt tweaks or model swaps while leaving the environment underspecified.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:12:07.801465+00:00— report_created — created