Skip to content

[Bug] eksctl delete nodegroup can delete nodes with running workloads due to ASG scale-up during drain #8711

@mo-rieger

Description

@mo-rieger

When deleting nodegroups with eksctl delete nodegroup, workloads can be terminated without being safely rescheduled to surviving nodes. This happens because the ASG can launch new nodes into nodegroups that are being drained (triggered by cluster autoscaler, AZ rebalancing, health checks, etc.), and those new nodes receive evicted workloads but are then deleted by CloudFormation stack deletion.

Steps to reproduce

  1. Have a cluster with multiple nodegroups running workloads
  2. Run eksctl delete nodegroup targeting multiple nodegroups simultaneously
  3. The drain process cordons existing nodes and evicts pods
  4. Evicted pods become Pending → the cluster autoscaler (or other ASG scaling triggers) scales up the same nodegroups being deleted, launching new uncordoned nodes
  5. Evicted pods are scheduled onto these new nodes
  6. Drain completes — the new nodes may or may not be caught by the drain loop's re-list
  7. CloudFormation stack deletion terminates all instances, including the new ones carrying workloads

Expected behavior

Workloads should be safely moved to nodes outside the deletion set before nodegroup deletion proceeds. No workloads should be running on any node in the targeted nodegroups when CloudFormation stack deletion begins.

Actual behavior

New nodes are launched into the nodegroups being deleted after the existing nodes are cordoned. These new nodes are not cordoned, receive evicted workloads, and are subsequently terminated by CloudFormation stack deletion — causing unexpected workload disruption.

Analysis

The drain loop in pkg/drain/nodegroup.go:110-172 re-lists nodes on each iteration to handle "accidental scale-up" (per the comment on line 110), but this is a race condition — new nodes can appear after the final re-list but before CloudFormation deletion. More fundamentally, the drain operates only at the Kubernetes node level (cordon + evict) and does nothing to prevent the ASG from launching replacement instances.

Suggested fix

Primary: Suspend ASG scaling processes before drain

Before draining, call SuspendProcesses on each nodegroup's ASG, suspending at minimum Launch, ReplaceUnhealthy, and AZRebalance. This prevents the ASG from launching new instances during drain, regardless of what triggered the scale-up.

The building blocks already exist in eksctl:

  • SuspendProcesses is already in the ASG API interface (pkg/awsapi/autoscaling.go:862)
  • A working suspendProcesses task exists (pkg/eks/nodegroup.go:185-213) that resolves the ASG name from the CF stack and calls SuspendProcesses
  • ASG name lookup via stackCollection.GetAutoScalingGroupName() is already implemented
  • Process name validation is already in place (pkg/apis/eksctl.io/v1alpha5/validation.go:1555)

This approach is:

  • Autoscaler-agnostic — works at the AWS ASG level, not tied to cluster autoscaler or Karpenter
  • Non-destructive — suspending Launch does not affect existing running instances
  • Self-cleaning — CloudFormation stack deletion removes the ASG anyway, no need to resume processes

For managed nodegroups, the underlying ASG name can be retrieved via the EKS DescribeNodegroup API (resources.autoScalingGroups field).

Secondary: Fail-safe check after drain

As a simpler interim safeguard or defense-in-depth measure: after all nodegroup drains complete but before deletion begins, re-list nodes for each nodegroup. If any new undrained nodes exist, fail with a clear error message so the user can retry.

eksctl version: 0.224.0
kubectl version: v1.34

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions