Skip to content

ClusterClass continuously reconciling resource due to incorrect apiVersion #9557

Open
@mnaser

Description

@mnaser

What steps did you take and what happened?

When using CAPO, I've noticed that I had a cluster that was reconciling non-stop and eating up a ton of CPU, upon further troubleshooting, I noticed that the reconciler seems to not actually grab the latest version of the CRD when making the request (my guess is that in the db, it's still using v1alpha6 but presenting v1alpha7 for user).

You can see that v1alpha7 is the newest version:

❯ kubectl -n magnum-system get crd/openstackclusters.infrastructure.cluster.x-k8s.io -oyaml | grep 'cluster.x-k8s.io/v1beta1'
    cluster.x-k8s.io/v1beta1: v1alpha5_v1alpha6_v1alpha7

The Cluster resource agrees with this too:

❯ kubectl -n magnum-system get cluster/kube-cmd33 -oyaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
...
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha7
    kind: OpenStackCluster
    name: kube-cmd33-4k46x
    namespace: magnum-system
...

However, you can see that when it tries to make a request to update when bringing the verbosity all the way up, snipped this from logs:

I1014 02:10:26.010999       1 round_trippers.go:466] curl -v -XPATCH  -H "User-Agent: manager/v1.5.1 cluster-api-controller-manager (linux/amd64) cluster.x-k8s.io/db17cb2" -H "Authorization: Bearer <masked>" -H "Content-Type: application/apply-patch+yaml" -H "Accept: application/json" 'https://10.96.0.1:443/apis/infrastructure.cluster.x-k8s.io/v1alpha6/namespaces/magnum-system/openstackclusters/kube-cmd33-4k46x?fieldManager=capi-topology&force=true'

And because of that, it almost always 'notices' a change, and loops endlessly, I tried to make a diff with the info that it is sending...

❯ diff -uNr /tmp/currobj.yml /tmp/dryrunapplied.yml
--- /tmp/currobj.yml	2023-10-13 21:45:47
+++ /tmp/dryrunapplied.yml	2023-10-13 21:46:00
@@ -1,13 +1,14 @@
-apiVersion: infrastructure.cluster.x-k8s.io/v1alpha7
+apiVersion: infrastructure.cluster.x-k8s.io/v1alpha6
 kind: OpenStackCluster
 metadata:
   annotations:
     cluster.x-k8s.io/cloned-from-groupkind: OpenStackClusterTemplate.infrastructure.cluster.x-k8s.io
     cluster.x-k8s.io/cloned-from-name: magnum-v0.9.1
+    topology.cluster.x-k8s.io/dry-run: ""
   creationTimestamp: "2023-07-23T09:03:45Z"
   finalizers:
   - openstackcluster.infrastructure.cluster.x-k8s.io
-  generation: 3429950
+  generation: 3430011
   labels:
     cluster.x-k8s.io/cluster-name: kube-cmd33
     topology.cluster.x-k8s.io/owned: ""
@@ -20,7 +21,7 @@
     kind: Cluster
     name: kube-cmd33
     uid: 0108ecb7-9e6e-4045-a5f0-811a8aade488
-  resourceVersion: "89143700"
+  resourceVersion: "89143889"
   uid: 0abd98ab-6010-43be-b028-44a0df84e597
 spec:
   allowAllInClusterTraffic: false
@@ -154,16 +155,16 @@
   network:
     id: a91dc22f-86fc-4677-938b-f15da173178e
     name: k8s-clusterapi-cluster-magnum-system-kube-cmd33
-    subnets:
-    - cidr: 10.0.0.0/24
+    router:
+      id: dcf60b96-6ceb-42fe-8d17-7f2c1b8b99a8
+      ips:
+      - 46.246.75.135
+      name: k8s-clusterapi-cluster-magnum-system-kube-cmd33
+    subnet:
+      cidr: 10.0.0.0/24
       id: 0ddb7a30-1bcb-4940-83b6-bf91ddadec8b
       name: k8s-clusterapi-cluster-magnum-system-kube-cmd33
   ready: true
-  router:
-    id: dcf60b96-6ceb-42fe-8d17-7f2c1b8b99a8
-    ips:
-    - A.B.C.D
-    name: k8s-clusterapi-cluster-magnum-system-kube-cmd33
   workerSecurityGroup:
     id: c1237980-280d-44de-9ff2-4fe5a4e20d9a
     name: k8s-cluster-magnum-system-kube-cmd33-secgroup-worker

So because there was a change in the OpenStackCluster, and its' pulling v1alpha6 (somehow) and v1alpha7 is the real expected version, it's just looping.. I feel like there's a spot here where it was missed to pull the up to date version of the infrastructureRef..

What did you expect to happen?

No loops and none of this to happen:

I1014 02:13:28.108463       1 reconcile_state.go:284] "Patching OpenStackCluster/kube-cmd33-4k46x" controller="topology/cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="magnum-system/kube-cmd33" namespace="magnum-system" name="kube-cmd33" reconcileID=2e8e2bc8-cc63-4b35-8c25-b83181bbd883 resource={Group:infrastructure.cluster.x-k8s.io Version:v1alpha6 Resource:OpenStackCluster} OpenStackCluster="magnum-system/kube-cmd33-4k46x"

looping.. non stop...

Cluster API version

Cluster API 1.5.1 + CAPO 0.8.0

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/clusterclassIssues or PRs related to clusterclasshelp wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions