Skip to content

additionalUserData.data formatting differences cause drift and Machine recreation #711

@mmb

Description

@mmb

What happened:

When adding additional cloud-init configuration to an RKE2ControlPlane the Machine's rendered RKE2Config has whitespace differences. On a single-node cluster this causes the provider to think the node has drifted and needs a rollout so the Machine gets deleted as soon as the control plane becomes ready. I have not tested with a multi-node control plane but I'm guessing this would cause continuous rollouts.

This only happens with additionalUserData.data, not additionalUserData.config which works as a workaround.

Example additionalUserData inRKE2ControlPlaneTemplate:

---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: RKE2ControlPlaneTemplate
metadata:
  name: test
spec:
  template:
    spec:
      agentConfig:
        additionalUserData:
          data:
            bootcmd: |
              - echo "fs.inotify.max_user_instances=8192" >> /etc/sysctl.d/99-custom.conf
              - sysctl --system
            device_aliases:
              data: /dev/nvme0n1
            disk_setup:
              data:
                layout: true
                overwrite: true
                table_type: gpt
              fs_setup:
                - device: data.1
                   filesystem: ext4
                   label: data
              mounts:
                - [ LABEL=data, /data]

This shows the whitespace differences between the RKE2ControlPlane and the RKE2Config:

{"spec"=>
  {"agentConfig"=>
    {"additionalUserData"=>
      {"data"=>
        {"bootcmd"=>
          "- echo \"fs.inotify.max_user_instances=8192\" >> /etc/sysctl.d/99-custom.conf\n" +
          "- sysctl --system\n",
         "device_aliases"=>"data: /dev/nvme0n1\n",
         "disk_setup"=>
          "data:\n" +
          "  layout: true\n" +
          "  overwrite: true\n" +
          "  table_type: gpt\n",
         "fs_setup"=>
          "- device: data.1\n" +
          "  filesystem: ext4\n" +
          "  label: data\n",
         "mounts"=>"- [ LABEL=data, /data ]\n"}}}}}

{"spec"=>
  {"agentConfig"=>
    {"additionalUserData"=>
      {"data"=>
        {"bootcmd"=>
          "\n" +
          "- echo \"fs.inotify.max_user_instances=8192\" >> /etc/sysctl.d/99-custom.conf\n" +
          "- sysctl --system\n",
         "device_aliases"=>"\n" + "  data: /dev/nvme0n1\n" + "  ",
         "disk_setup"=>
          "\n" +
          "  data:\n" +
          "    layout: true\n" +
          "    overwrite: true\n" +
          "    table_type: gpt\n" +
          "  ",
         "fs_setup"=>
          "\n" +
          "- device: data.1\n" +
          "  filesystem: ext4\n" +
          "  label: data\n",
         "mounts"=>"\n" + "- - LABEL=data\n" + "  - /data\n"}}}}}

What did you expect to happen:

There should be no changes so a rollout is not triggered.

How to reproduce it:

Create an RKEControlPlane with additionaUserData.data like above. Bring up a 1-node cluster and see that the machine is deleted after the control plane becomes ready.

Anything else you would like to add:

Environment:

  • rke provider version: 0.17.1
  • OS (e.g. from /etc/os-release):

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't workingpriority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions