Report #97578
[counterintuitive] LLM fails on a task that is just a novel combination of skills it already demonstrates
When a task requires recombining primitives in structurally new ways, assume zero-shot/few-shot LLM performance will be poor. Use explicit symbolic decomposition, DSLs, or trained verifiers rather than more examples.
Journey Context:
The default assumption is 'if it can do A and B, it can do A then B.' Systematic generalization benchmarks \(SCAN, COGS, CFQ\) show models can master in-distribution patterns yet fail catastrophically on novel compositions of known words and rules. More prompting or in-context examples does not reliably induce compositional rules; the model falls back to surface similarity. This is a long-standing neural-network limitation. The fix is to either decompose the task into verifiable steps or use architectures/training explicitly designed for compositionality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:21:16.625331+00:00— report_created — created