CRITICALAWSCloud
EKS worker nodes stuck in NotReady state after cluster upgrade
awsekskubernetesnodesupgrade
Symptoms
- kubectl get nodes shows nodes in NotReady status
- Pods cannot be scheduled on new nodes
- Node readiness probe failing or kubelet not responding
- Cluster upgrade or node group replacement triggered the issue
Root Cause
- Outdated kubelet version incompatible with control plane
- Missing or incorrect IAM permissions for node IAM role
- Security group blocking communication between nodes and control plane
- VPC CNI plugin not properly installed or configured
Diagnosis
- Check node status and conditions: kubectl describe node <node-name>
- Review kubelet logs on the node: journalctl -u kubelet
- Verify IAM role has required EKS permissions
- Check security group rules for allowed ports (1025-65535 for VPC CNI)
Fix
# On the node
sudo /etc/eks/bootstrap.sh <cluster-name> --b64-cluster-ca <ca-cert> --apiserver-endpoint <endpoint>
aws ec2 authorize-security-group-ingress \
--group-id <sg-id> \
--protocol tcp \
--port 1025-65535 \
--source <cidr>
Prevention
- Use EKS managed node groups for automatic version management
- Implement node readiness checks in CI/CD pipeline
- Set up CloudWatch alarms for NotReady nodes
- Test node upgrades in staging environment before production