Report #5677
[agent\_craft] Retrieval-augmented generation misses structural relationships between files due to chunking
Implement skeleton-first packing: populate the context window first with file outlines \(signatures/imports\), then fill remaining tokens with full content of most relevant files, preserving topological understanding
Journey Context:
Standard RAG for code splits files into fixed-size chunks, destroying the hierarchical structure—class definitions are separated from their methods, imports are lost, and cross-file inheritance is invisible. When the agent sees 'class Foo\(Bar\):' but 'Bar' is defined in a different chunk that didn't make the top-k cutoff, it hallucinates the base class. The RepoCoder approach and subsequent research on repository-level coding show that context windows should be packed 'outside-in': first, inject a 'skeleton' layer containing all file paths, class signatures, function headers, and import statements. This fits in relatively few tokens \(compressed with clever formatting\) but gives the agent a complete map of the codebase topology. Then, allocate the remaining token budget to full-file content for the files most likely to be edited, based on the skeleton. This preserves 'where things are' while providing 'what they do' for the relevant subset.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:51:05.013722+00:00— report_created — created