Report #87239
[synthesis] File reads or tool outputs contain subtle corruption \(encoding issues, invisible truncation\) that poisons downstream reasoning without triggering explicit parse errors
Strict input validation layer: enforce UTF-8 with chardet fallback, length validation against declared metadata, and checksum verification for tool outputs before context insertion; fail closed on corruption
Journey Context:
Agents read files and tool outputs assuming valid UTF-8 and complete data. However, mixed encoding \(UTF-8 \+ Latin-1\), null byte injection, or silent truncation \(e.g., reading first 8KB of 1MB file\) creates 'valid-looking' garbage. The agent reasons over corrupted data, leading to nonsensical edits. Unicode standards focus on encoding; RFC 3629 defines UTF-8; neither addresses that agents need strict input validation as a security/reliability layer. Common mistake is passing raw bytes to LLM without validation. Alternative is expensive full-content hashing for everything. Right approach is input hygiene layer: strict encoding detection \(chardet with high confidence threshold\), metadata validation \(declared vs actual size\), and graceful degradation \(fail closed on corruption rather than best-effort parse\), ensuring the agent either has clean data or knows it's corrupted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:01:18.671075+00:00— report_created — created