Inconsistencies with MachinePools during Kubernetes upgrade #5546
Labels
kind/bug
Categorizes issue or PR as related to a bug.
priority/critical-urgent
Highest priority. Must be actively worked on as someone's top priority right now.
Milestone
/kind bug
What steps did you take and what happened:
Thursday 3rd of April, we suffered from 2 problems, one being a major disruption of our services during a MachinePool upgrade.
The initial goal was to upgrade "selfmanaged" BYO network clusters created through capi/capz from Kubernetes 1.29 towards Kubernetes 1.32
AzureMachinePool partial upgrade (preprod)
We started by performing multiple rolling upgrades of the workload cluster control plane, one version at a time.
No problems occurred and everything went smooth. We also deployed the associated Kubernetes version of
cloud-provider-azure
.It all went bad when we started to run the upgrade on our
MachinePool
/AzureMachinePool
objects. (k999azc-endor001
)To do so we've
k999azc-endor001
MachinePool'sspec.template.spec.version
field tov1.32.0
k999azc-endor001
AzureMachinePool'sspec.template.image.computeGallery.version
field to the name of theimage we just build.
We're baking images our own way, not using
image-builder
as we feel it is still too complex to customize output images.Here's the section of the
AzureMachinePool
that might be interestingAnd a screenshot of
clusterctl describe
after applying the changeSo far, so good, the VMSS scaled up to 4 instances (thanks to the
maxSurge
of1
), the VM got provisioned and the freshly created node joined the cluster. (we have3
replicas for that pool)And then, the rollout just stopped. It feels that
capi
orcapz
behaves like if all the machines were upgraded even though that's not true.As a workaround, we ran multiple manual scale using kubectl until we reach the desired state.
Well, that was not smooth but no biggies there.
Upon debugging, I saw two things that felt weird to me, as a user when looking at the Azure Cloud console on the underlying VMSS:
Default
even though the strategy of theAzureMachinePool
isOldest
Manual
I am not sure if it's the VMSS that is responsible of rolling out all the machines of if
capz
is handling the process so that might be unrelated.Also, we can reproduce this problem and we didn't saw anything useful/concerning from the logs.
AzureMachinePool upgrade goes bonkers (prod)
Since the upgrade went fine on a preprod cluster, we decided to perform it on production one but it didn't went as planned.
Also, sorry if this section might be obscure, we worked hard to have the shortest downtime so we weren't able to take as many notes.
We followed the same process described in the previous section but we forgot to delete
azure-container-registry-config: /etc/kubernetes/azure.json
from thekubeletExtraArgs
in the underlying MachinePool'sKubeadmConfig
.Because of that, the freshly provisioned node couldn't join the cluster because the
kubelet
flag--azure-container-registry-config
being removed starting from Kubernetes 1.30 (if I recall correctly)Then, because of that line removal we performed on the
Kubeadm
config, I believecapi
triggered a second upgrade of the nodes.Scaling by hand like we did created a bunch of problems such as the VMSS live replicas not being in sync with the
spec.providerIDList
of theMachinePool
object - recreating every time ghostMachine
andAzureMachinePoolMachine
we were deleting. (ghost meaning nodes unexistant in AzureCloud but present in CAPI state)The worse happened when we tried to scale to two replicas for that pool and that for an unknown reason,
capi
/capz
scaled the VMSS to 1, degrading our stack drastically.To solve the problem we had to:
Cluster
object and set the reconciliation to bepaused: true
MachinePool
,AzureMachinePool
) and update thespec.providerIDList
from theMachinePool
object.MachinePool
with the number we had on the VMSS on Azure Cloudcapz-system
in case there's in-memory cache mechanismFollowing those manipulations, new nodes were created and everything went back in order.
Environment:
Regarding the setup we're using:
The text was updated successfully, but these errors were encountered: