[Bug] eksctl delete nodegroup can delete nodes with running workloads due to ASG scale-up during drain

When deleting nodegroups with `eksctl delete nodegroup`, workloads can be terminated without being safely rescheduled to surviving nodes. This happens because the ASG can launch new nodes into nodegroups that are being drained (triggered by cluster autoscaler, AZ rebalancing, health checks, etc.), and those new nodes receive evicted workloads but are then deleted by CloudFormation stack deletion.

### Steps to reproduce

1. Have a cluster with multiple nodegroups running workloads
2. Run `eksctl delete nodegroup` targeting multiple nodegroups simultaneously
3. The drain process cordons existing nodes and evicts pods
4. Evicted pods become Pending → the cluster autoscaler (or other ASG scaling triggers) scales up the **same nodegroups being deleted**, launching new uncordoned nodes
5. Evicted pods are scheduled onto these new nodes
6. Drain completes — the new nodes may or may not be caught by the drain loop's re-list
7. CloudFormation stack deletion terminates all instances, including the new ones carrying workloads

### Expected behavior

Workloads should be safely moved to nodes **outside** the deletion set before nodegroup deletion proceeds. No workloads should be running on any node in the targeted nodegroups when CloudFormation stack deletion begins.

### Actual behavior

New nodes are launched into the nodegroups being deleted after the existing nodes are cordoned. These new nodes are not cordoned, receive evicted workloads, and are subsequently terminated by CloudFormation stack deletion — causing unexpected workload disruption.

### Analysis

The drain loop in `pkg/drain/nodegroup.go:110-172` re-lists nodes on each iteration to handle "accidental scale-up" (per the comment on line 110), but this is a race condition — new nodes can appear after the final re-list but before CloudFormation deletion. More fundamentally, the drain operates only at the Kubernetes node level (cordon + evict) and does nothing to prevent the ASG from launching replacement instances.

### Suggested fix

**Primary: Suspend ASG scaling processes before drain**

Before draining, call `SuspendProcesses` on each nodegroup's ASG, suspending at minimum `Launch`, `ReplaceUnhealthy`, and `AZRebalance`. This prevents the ASG from launching new instances during drain, regardless of what triggered the scale-up.

The building blocks already exist in eksctl:
- `SuspendProcesses` is already in the ASG API interface (`pkg/awsapi/autoscaling.go:862`)
- A working `suspendProcesses` task exists (`pkg/eks/nodegroup.go:185-213`) that resolves the ASG name from the CF stack and calls `SuspendProcesses`
- ASG name lookup via `stackCollection.GetAutoScalingGroupName()` is already implemented
- Process name validation is already in place (`pkg/apis/eksctl.io/v1alpha5/validation.go:1555`)

This approach is:
- **Autoscaler-agnostic** — works at the AWS ASG level, not tied to cluster autoscaler or Karpenter
- **Non-destructive** — suspending `Launch` does not affect existing running instances
- **Self-cleaning** — CloudFormation stack deletion removes the ASG anyway, no need to resume processes

For managed nodegroups, the underlying ASG name can be retrieved via the EKS `DescribeNodegroup` API (`resources.autoScalingGroups` field).

**Secondary: Fail-safe check after drain**

As a simpler interim safeguard or defense-in-depth measure: after all nodegroup drains complete but before deletion begins, re-list nodes for each nodegroup. If any new undrained nodes exist, fail with a clear error message so the user can retry.

eksctl version: 0.224.0
kubectl version: v1.34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] eksctl delete nodegroup can delete nodes with running workloads due to ASG scale-up during drain #8711

Steps to reproduce

Expected behavior

Actual behavior

Analysis

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] eksctl delete nodegroup can delete nodes with running workloads due to ASG scale-up during drain #8711

Description

Steps to reproduce

Expected behavior

Actual behavior

Analysis

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions