Report #21119
[gotcha] Excessive DNS queries and latency from Kubernetes pods due to ndots:5 search domain behavior
Explicitly set \`dnsPolicy: ClusterFirst\` \(default\) but override \`dnsConfig\` to reduce \`ndots\` from 5 to 2 or 1 for microservices that primarily use FQDNs \(fully qualified domain names\) for external calls. Alternatively, ensure all inter-service calls use FQDNs \(ending in a dot, e.g., \`service.namespace.svc.cluster.local.\`\). For external domains, use FQDNs or configure search domains explicitly. Monitor CoreDNS metrics for \`forward\` plugin latency and \`template\` plugin NXDOMAIN counts to detect the storm.
Journey Context:
By default, Kubernetes sets \`ndots:5\` and search domains \`\[namespace.svc.cluster.local, svc.cluster.local, cluster.local, ec2.internal \(on AWS\)\]\`. When an app resolves a name like \`google.com\`, the resolver first checks if it contains 5\+ dots. Since it has 1, it treats it as relative and appends each search domain: \`google.com.namespace.svc.cluster.local\`, \`google.com.svc.cluster.local\`, etc., generating 5\+ DNS queries per lookup, all returning NXDOMAIN, before finally trying the absolute \`google.com.\` \(if ndots is satisfied or after search list\). This overloads CoreDNS, increases latency, and can hit AWS VPC Resolver DNS quota limits \(1024 packets per second per ENI\). Common mistakes: Not using FQDNs in app configs; assuming 'google.com' is efficient; not monitoring CoreDNS \`forward\` latency. Alternatives considered: Using \`dnsPolicy: Default\` \(bypasses ClusterDNS, loses service discovery\); disabling search domains entirely \(breaks short-name service discovery\). Why the fix is right: Lowering \`ndots\` to 2 \(typical for FQDNs like \`service.ns.svc.cluster.local\` which has 4 dots\) or using FQDNs with trailing dots short-circuits the search list immediately, cutting queries by 80%\+ while preserving service discovery for FQDNs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:51:38.746881+00:00— report_created — created