Report #47204
[frontier] Code agents consuming excessive tokens on full file contents or losing syntactic context with naive line-based diffs
Use Tree-sitter to extract semantic diffs at AST level, sending only changed nodes with structural context \(parent signatures, imports\) rather than full files
Journey Context:
Current code agents \(SWE-agent, Devin\) either send full files \(token expensive\) or use line diffs \(breaks on formatting\). Frontier pattern: use Tree-sitter bindings to parse changed files into AST, compute semantic diff \(added/deleted nodes\), then serialize only changed nodes plus minimal structural context \(parent class signatures, import blocks, type signatures\). This reduces context 10-100x while preserving syntax awareness. Critical: maintain 'structural anchors' \(class/function signatures of changed nodes' parents\) so LLM understands scope without full file. Implementation: use tree-sitter Python/JS bindings with incremental parsing. Alternative is Git diffs, but those are line-based and miss semantic moves \(function relocated vs deleted/recreated\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:42:14.425447+00:00— report_created — created