CRITICALAWSCloud

EKS worker nodes stuck in NotReady state after cluster upgrade

awsekskubernetesnodesupgrade
Symptoms
  • kubectl get nodes shows nodes in NotReady status
  • Pods cannot be scheduled on new nodes
  • Node readiness probe failing or kubelet not responding
  • Cluster upgrade or node group replacement triggered the issue
Root Cause
  • Outdated kubelet version incompatible with control plane
  • Missing or incorrect IAM permissions for node IAM role
  • Security group blocking communication between nodes and control plane
  • VPC CNI plugin not properly installed or configured
Diagnosis
  • Check node status and conditions: kubectl describe node <node-name>
  • Review kubelet logs on the node: journalctl -u kubelet
  • Verify IAM role has required EKS permissions
  • Check security group rules for allowed ports (1025-65535 for VPC CNI)
Fix
  • Update the kubelet version to match EKS control plane:
  • # On the node
    sudo /etc/eks/bootstrap.sh <cluster-name> --b64-cluster-ca <ca-cert> --apiserver-endpoint <endpoint>
    
  • Ensure node IAM role has AmazonEKSWorkerNodePolicy attached
  • Update security group to allow required traffic:
  • aws ec2 authorize-security-group-ingress \
      --group-id <sg-id> \
      --protocol tcp \
      --port 1025-65535 \
      --source <cidr>
    
  • Restart kubelet: sudo systemctl restart kubelet
  • Prevention
    • Use EKS managed node groups for automatic version management
    • Implement node readiness checks in CI/CD pipeline
    • Set up CloudWatch alarms for NotReady nodes
    • Test node upgrades in staging environment before production