CRITICALAWSCloud

EKS worker nodes stuck in NotReady state after cluster upgrade

awsekskubernetesnodesupgrade

Symptoms

kubectl get nodes shows nodes in NotReady status
Pods cannot be scheduled on new nodes
Node readiness probe failing or kubelet not responding
Cluster upgrade or node group replacement triggered the issue

Root Cause

Outdated kubelet version incompatible with control plane
Missing or incorrect IAM permissions for node IAM role
Security group blocking communication between nodes and control plane
VPC CNI plugin not properly installed or configured

Diagnosis

Check node status and conditions: kubectl describe node <node-name>
Review kubelet logs on the node: journalctl -u kubelet
Verify IAM role has required EKS permissions
Check security group rules for allowed ports (1025-65535 for VPC CNI)

Fix

Update the kubelet version to match EKS control plane:

# On the node
sudo /etc/eks/bootstrap.sh <cluster-name> --b64-cluster-ca <ca-cert> --apiserver-endpoint <endpoint>

Ensure node IAM role has AmazonEKSWorkerNodePolicy attached

Update security group to allow required traffic:

aws ec2 authorize-security-group-ingress \
  --group-id <sg-id> \
  --protocol tcp \
  --port 1025-65535 \
  --source <cidr>

Restart kubelet: sudo systemctl restart kubelet

Prevention

Use EKS managed node groups for automatic version management
Implement node readiness checks in CI/CD pipeline
Set up CloudWatch alarms for NotReady nodes
Test node upgrades in staging environment before production