Report #1226
[bug\_fix] NodeNotReady with PLEG unhealthy and container runtime timeout
SSH to the node and check the container runtime status \(\`systemctl status containerd\` or \`crictl ps\`\). If the runtime is hung because of too many containers or exhausted inotify watches, restart the runtime service or the node after draining it. For inotify exhaustion, increase \`fs.inotify.max\_user\_watches\` and \`fs.inotify.max\_user\_instances\`. If disk pressure is the cause, free image layers and logs or expand the node disk.
Journey Context:
Several pods were rescheduled repeatedly and \`kubectl get nodes\` showed one node as \`NotReady\`. \`kubectl describe node\` reported \`Ready=False\` with reason \`container runtime status check may not have completed yet\` and the kubelet logs showed \`PLEG is not healthy: pleg was last seen active ... timeout 3m0s\`. On the node, \`systemctl status containerd\` was active but \`crictl ps\` hung. The kubelet log also had \`too many open files\` and \`failed to create shim task: OCI runtime create failed\`. I drained the node with \`kubectl drain\`, rebooted it, and raised \`fs.inotify.max\_user\_watches\` and \`fs.inotify.max\_user\_instances\` via sysctl. After it came back and I uncordoned it, the node stayed Ready. The root cause was a monitoring DaemonSet that created and destroyed many short-lived containers, exhausting inotify watches and eventually hanging containerd.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:53:24.868008+00:00— report_created — created