Inconsistencies with MachinePools during Kubernetes upgrade #5546

MadJlzz · 2025-04-07T07:50:31Z

/kind bug

What steps did you take and what happened:

Thursday 3rd of April, we suffered from 2 problems, one being a major disruption of our services during a MachinePool upgrade.

The initial goal was to upgrade "selfmanaged" BYO network clusters created through capi/capz from Kubernetes 1.29 towards Kubernetes 1.32

AzureMachinePool partial upgrade (preprod)

We started by performing multiple rolling upgrades of the workload cluster control plane, one version at a time.

No problems occurred and everything went smooth. We also deployed the associated Kubernetes version of cloud-provider-azure.

It all went bad when we started to run the upgrade on our MachinePool/AzureMachinePool objects. (k999azc-endor001)

To do so we've

updated k999azc-endor001 MachinePool's spec.template.spec.version field to v1.32.0
updated k999azc-endor001 AzureMachinePool's spec.template.image.computeGallery.version field to the name of the
image we just build.

We're baking images our own way, not using image-builder as we feel it is still too complex to customize output images.

Here's the section of the AzureMachinePool that might be interesting

orchestrationMode: Flexible
strategy:
  rollingUpdate:
    deletePolicy: Oldest
      maxSurge: 1
      maxUnavailable: 0
  type: RollingUpdate

And a screenshot of clusterctl describe after applying the change

So far, so good, the VMSS scaled up to 4 instances (thanks to the maxSurge of 1), the VM got provisioned and the freshly created node joined the cluster. (we have 3 replicas for that pool)

And then, the rollout just stopped. It feels that capi or capz behaves like if all the machines were upgraded even though that's not true.

As a workaround, we ran multiple manual scale using kubectl until we reach the desired state.

# Scale up to deploy a machine with the new image reference.  
kubectl scale machinepool k999azc-endor001 --replicas=4 -n k999azc  
# Scale down and delete the oldest VM.  
kubectl scale machinepool k999azc-endor001 --replicas=3 -n k999azc  
# Rinse and repeat until all machines are upgraded.

Well, that was not smooth but no biggies there.

Upon debugging, I saw two things that felt weird to me, as a user when looking at the Azure Cloud console on the underlying VMSS:

the scale-in-policy is set to Default even though the strategy of the AzureMachinePool is Oldest
the upgrade policy is set to Manual

I am not sure if it's the VMSS that is responsible of rolling out all the machines of if capz is handling the process so that might be unrelated.

Also, we can reproduce this problem and we didn't saw anything useful/concerning from the logs.

AzureMachinePool upgrade goes bonkers (prod)

Since the upgrade went fine on a preprod cluster, we decided to perform it on production one but it didn't went as planned.

Also, sorry if this section might be obscure, we worked hard to have the shortest downtime so we weren't able to take as many notes.

We followed the same process described in the previous section but we forgot to delete azure-container-registry-config: /etc/kubernetes/azure.json from the kubeletExtraArgs in the underlying MachinePool's KubeadmConfig.

Because of that, the freshly provisioned node couldn't join the cluster because the kubelet flag --azure-container-registry-config being removed starting from Kubernetes 1.30 (if I recall correctly)

Then, because of that line removal we performed on the Kubeadm config, I believe capi triggered a second upgrade of the nodes.

Scaling by hand like we did created a bunch of problems such as the VMSS live replicas not being in sync with the spec.providerIDList of the MachinePool object - recreating every time ghost Machine and AzureMachinePoolMachine we were deleting. (ghost meaning nodes unexistant in AzureCloud but present in CAPI state)

The worse happened when we tried to scale to two replicas for that pool and that for an unknown reason, capi/capz scaled the VMSS to 1, degrading our stack drastically.

To solve the problem we had to:

edit the Cluster object and set the reconciliation to be paused: true
delete the ghost objects (MachinePool, AzureMachinePool ) and update the spec.providerIDList from the MachinePool object.
synchronize the number of replicas of the MachinePool with the number we had on the VMSS on Azure Cloud
we killed the pods inside capz-system in case there's in-memory cache mechanism
we unpaused the cluster

Following those manipulations, new nodes were created and everything went back in order.

Environment:

Regarding the setup we're using:

A GKE Kubernetes cluster

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.31.6-gke.1020000

CAPI and CAPZ (deployed through the operator)

> kubectl get bootstrapproviders,controlplaneproviders,controlplaneproviders,infrastructureproviders -A
NAMESPACE                  NAME                                                  INSTALLEDVERSION   READY
kubeadm-bootstrap-system   bootstrapprovider.operator.cluster.x-k8s.io/kubeadm   v1.9.5             True

NAMESPACE                      NAME                                                     INSTALLEDVERSION   READY
kubeadm-control-plane-system   controlplaneprovider.operator.cluster.x-k8s.io/kubeadm   v1.9.5             True

NAMESPACE     NAME                                                     INSTALLEDVERSION   READY
capa-system   infrastructureprovider.operator.cluster.x-k8s.io/aws     v2.8.1             True
capz-system   infrastructureprovider.operator.cluster.x-k8s.io/azure   v1.17.4            True

The text was updated successfully, but these errors were encountered:

dtzar · 2025-04-07T22:31:49Z

Just curious - have you tried doing the upgrade using the latest 1.19 version of CAPZ (versus as-shown 1.17.4)?

MadJlzz · 2025-04-08T09:04:29Z

I did not, I wanted to play safe and wait for #5410 to be fixed first before upgrading. I saw it was merged so if you do a v1.19.2 I can try to reproduce on my end

nawazkh · 2025-04-10T16:47:41Z

@MadJlzz could you please retry since #5410 has been fixed ?

MadJlzz · 2025-04-10T18:33:03Z

@nawazkh even after the upgrade the behaviour is similar, it does not rollout all the machines from the pool

github-project-automation bot added this to CAPZ Planning Apr 7, 2025

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Apr 7, 2025

nawazkh added this to the v1.20 milestone Apr 10, 2025

nawazkh moved this from Todo to Wait-On-Author in CAPZ Planning Apr 10, 2025

mboersma added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistencies with MachinePools during Kubernetes upgrade #5546

Inconsistencies with MachinePools during Kubernetes upgrade #5546

MadJlzz commented Apr 7, 2025

dtzar commented Apr 7, 2025

Uh oh!

MadJlzz commented Apr 8, 2025 •

edited

Loading

Uh oh!

nawazkh commented Apr 10, 2025

Uh oh!

MadJlzz commented Apr 10, 2025

Uh oh!

Inconsistencies with MachinePools during Kubernetes upgrade #5546

Inconsistencies with MachinePools during Kubernetes upgrade #5546

Comments

MadJlzz commented Apr 7, 2025

AzureMachinePool partial upgrade (preprod)

AzureMachinePool upgrade goes bonkers (prod)

dtzar commented Apr 7, 2025

Uh oh!

MadJlzz commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nawazkh commented Apr 10, 2025

Uh oh!

MadJlzz commented Apr 10, 2025

Uh oh!

MadJlzz commented Apr 8, 2025 •

edited

Loading