Skip to content

[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949

Open
L3n41c wants to merge 5 commits intomainfrom
lenaic/CASCL-1305-foreign-karpenter-guard
Open

[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949
L3n41c wants to merge 5 commits intomainfrom
lenaic/CASCL-1305-foreign-karpenter-guard

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Apr 27, 2026

What does this PR do?

Adds a pre-flight guard to kubectl datadog autoscaling cluster install that recognises when Karpenter is already running on the cluster — be it a Helm install in another namespace, a raw kubectl apply, or anything in between — and short-circuits with a friendly success message and a link to the Datadog Autoscaling settings page, mirroring the existing EKS auto-mode handling (no error, exit 0).

Detection lists every ClusterRole on the cluster and looks for any rule referencing the karpenter.sh API group, which the chart's clusterrole.yaml and clusterrole-core.yaml hard-code for nodepools/nodeclaims regardless of nameOverride or other metadata customizations. ClusterRoles bearing our autoscaling.datadoghq.com/installed-by=kubectl-datadog sentinel are skipped. CRDs are deliberately not used — Helm leaves them behind on helm uninstall, and re-running install after uninstall must succeed in spite of those leftovers.

The PR also switches the chart's additionalLabels from overriding standard app.kubernetes.io/managed-by and app.kubernetes.io/version to Datadog-namespaced keys, exported as constants so the writer (install.go) and reader (IsForeignKarpenterInstalled) stay in sync.

Motivation

PR #2717 (CASCL-1281) added a guard for EKS auto-mode clusters. We need a symmetric guard for any other cluster where Karpenter is already running, otherwise re-running install on such a cluster would deploy a second controller that would race with the first one on the same CRDs.

We use a structural RBAC fingerprint rather than a label selector because the Karpenter chart's karpenter.name template renders both the app.kubernetes.io/name label value and the ClusterRole's name from nameOverride. A label-based selector would silently miss any install with --set nameOverride=... and let our install deploy a second controller side-by-side. The karpenter.sh API group is invariant across nameOverride changes — every Karpenter controller needs RBAC on it.

The Datadog-namespaced label switch is required because the Karpenter chart's _helpers.tpl emits app.kubernetes.io/managed-by: {{ .Release.Service }} and app.kubernetes.io/version: ... before additionalLabels, producing duplicate YAML keys. Verified empirically on a real cluster: deduplication at the API server is non-deterministic across keys (managed-by resolves to the chart's Helm value while version resolves to ours, on the very same resource), so we cannot rely on those overrides as identity markers.

Additional Notes

Migration of existing kubectl-datadog installs. Plugin installs deployed before this change carry the old app.kubernetes.io/managed-by=kubectl-datadog value (silently dropped by the chart) and lack the new sentinel. The new code will detect them as foreign and no-op. To migrate, run once:

kubectl datadog autoscaling cluster uninstall
kubectl datadog autoscaling cluster install

Acceptable because the install command is recent and not yet broadly deployed.

Out of scope. Resources we apply directly (NodePool / EC2NodeClass via cmd/kubectl-datadog/.../install/k8s/) keep their existing app.kubernetes.io/managed-by=kubectl-datadog label — no chart helper to fight there, the label sticks, and uninstall.go continues to use it as a selector.

Minimum Agent Versions

  • Agent: n/a
  • Cluster Agent: n/a

Describe your test plan

Unit tests:

  • go test ./cmd/kubectl-datadog/autoscaling/cluster/install/... — covers
    • IsForeignKarpenterInstalled: no Karpenter, ours-only, foreign-only, mixed, foreign sentinel value, ClusterRole with karpenter-looking labels but no karpenter.sh rules, foreign install with custom nameOverride, API list-error paths;
    • TestKarpenterAPIGroupContract pins the literal karpenter.sh API group so test fixtures using the constant cannot silently mask a typo;
    • TestKarpenterHelmValues pins the additionalLabels contract with the detector — both sides go through the same exported constants — and covers both install modes (only Fargate annotates the ServiceAccount with the IRSA role ARN);
    • TestDisplayForeignKarpenterMessage renders the box into a buffered cobra Command and asserts the user-visible strings, including the URL-encoded kube_cluster_name query that points the user at their cluster's Datadog autoscaling settings page.

End-to-end on a real EKS test cluster (AWS_PROFILE=exec-sso-container-integrations-account-admin AWS_REGION=eu-west-3):

  1. No-op on a foreign Helm Karpenterhelm install karpenter oci://public.ecr.aws/karpenter/karpenter --namespace karpenter --create-namespace --set settings.clusterName=$CLUSTER. Then kubectl datadog autoscaling cluster install → message + URL, exit 0, no dd-karpenter-… CFN stacks created (aws cloudformation list-stacks).
  2. No-op on a foreign Helm Karpenter with nameOverridehelm install foo oci://... --set nameOverride=foo,settings.clusterName=$CLUSTER. Then kubectl datadog autoscaling cluster install → still detected and no-op'd, even though no ClusterRole carries app.kubernetes.io/name=karpenter.
  3. No-op on a raw-applied Karpenter — apply a single ClusterRole with a rule on karpenter.sh and run install → variant of the no-op message, exit 0.
  4. Re-install after our uninstall (CRDs left over)helm uninstall karpenter -n karpenter, then kubectl datadog autoscaling cluster install → succeeds in spite of the four residual CRDs.
  5. Auto-mode short-circuit unchanged — on an auto-mode cluster, the auto-mode message is what shows up, not the new one.

Checklist

  • PR has at least one valid label: enhancement
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed

Add a pre-flight guard to `kubectl datadog autoscaling cluster install`
that recognises when Karpenter is already running on the cluster — be it
a Helm install in another namespace, a raw `kubectl apply`, or anything
in between — and short-circuits with a friendly success message and a
link to the Datadog Autoscaling settings page, mirroring the existing
EKS auto-mode handling. Detection scans cluster-scoped ClusterRoles
labelled `app.kubernetes.io/name=karpenter` for the absence of our
`autoscaling.datadoghq.com/installed-by=kubectl-datadog` sentinel; CRDs
are ignored because Helm leaves them behind on `helm uninstall` and a
re-install must still succeed in that case.

Switch the chart's `additionalLabels` from overriding standard
`app.kubernetes.io/managed-by` and `app.kubernetes.io/version` to
Datadog-namespaced keys: the chart's `_helpers.tpl` emits the standard
keys before `additionalLabels`, producing duplicate YAML keys whose
deduplication at the API server is non-deterministic (verified
empirically — `managed-by` resolves to the chart's `Helm` while
`version` resolves to ours, on the very same resource). The new keys
are exported as constants so writer and reader stay in sync.

Migration: existing plugin installs lack the new sentinel and will be
flagged as foreign on the next `install`. Users have to run
`kubectl datadog autoscaling cluster uninstall` followed by
`kubectl datadog autoscaling cluster install` once with the new binary
to migrate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added enhancement New feature or request qa/skip-qa labels Apr 27, 2026
@L3n41c L3n41c added this to the v1.27.0 milestone Apr 27, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 90.24390% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.19%. Comparing base (78490c3) to head (e0cef1f).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...ctl-datadog/autoscaling/cluster/install/install.go 77.77% 4 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2949      +/-   ##
==========================================
+ Coverage   40.91%   41.19%   +0.27%     
==========================================
  Files         324      325       +1     
  Lines       28743    28951     +208     
==========================================
+ Hits        11760    11925     +165     
- Misses      16129    16170      +41     
- Partials      854      856       +2     
Flag Coverage Δ
unittests 41.19% <90.24%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...oscaling/cluster/install/guess/foreignkarpenter.go 100.00% <100.00%> (ø)
...ctl-datadog/autoscaling/cluster/install/install.go 30.68% <77.77%> (+14.90%) ⬆️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78490c3...e0cef1f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@datadog-prod-us1-4
Copy link
Copy Markdown

datadog-prod-us1-4 Bot commented Apr 27, 2026

Code Coverage

🎯 Code Coverage (details)
Patch Coverage: 89.19%
Overall Coverage: 41.28% (+0.26%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: e0cef1f | Docs | Datadog PR Page | Give us feedback!

@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 27, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd65ec0c9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const (
karpenterChartNameLabel = "app.kubernetes.io/name"
karpenterChartNameValue = "karpenter"
karpenterClusterRoleSelector = karpenterChartNameLabel + "=" + karpenterChartNameValue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect nameOverride-based Karpenter installs

This selector misses existing Karpenter releases installed with the chart’s supported nameOverride value, because the upstream label app.kubernetes.io/name is rendered from that override rather than always being karpenter. In that scenario the list returns no ClusterRoles, so kubectl datadog autoscaling cluster install falls through and installs a second controller instead of no-oping; this affects Helm/Argo/raw-manifest installs that customize nameOverride.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Testing

  • gofmt -w cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.go cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.go
  • go test ./cmd/kubectl-datadog/autoscaling/cluster/install/guess/...

View task →

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in cdbb6c7: switched detection from the app.kubernetes.io/name=karpenter LabelSelector to a structural fingerprint on the karpenter.sh API group. We list every ClusterRole and look for a rule referencing that group, which the chart hard-codes for nodepools/nodeclaims regardless of nameOverride or other metadata customizations. Added a regression test foreign Karpenter installed with custom nameOverride and a defensive case to make sure a ClusterRole with karpenter-looking labels but no karpenter.sh rules does not trigger the guard.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix landed on this branch as cdbb6c7 (independently arrived at the same karpenter.sh API group fingerprint). No follow-up PR needed.

L3n41c and others added 2 commits April 27, 2026 22:54
Datadog PR Gates flagged the patch coverage at 40.74%, well below the
threshold. The new code in install.go (the Helm-values block carrying
the Datadog ownership sentinel and the foreign-Karpenter no-op message)
was unreached by tests. Adds:

- TestKarpenterHelmValues: pins the additionalLabels contract with the
  IsForeignKarpenterInstalled detector — both sides go through the same
  exported constants — and covers both install modes (the IRSA
  ServiceAccount annotation only appears on Fargate).
- TestDisplayForeignKarpenterMessage: renders the box into a buffered
  cobra Command and asserts the user-visible strings, including the
  URL-encoded kube_cluster_name query that points the user at their
  cluster's Datadog autoscaling settings page.

The display test sets PATH to empty so github.com/pkg/browser's
LookPath probe of xdg-open / x-www-browser / www-browser fails fast and
browser.OpenURL returns ErrNotFound — without that, the spawned
xdg-open holds the pipe writer for browser.Stdout and browser.Stderr
open and exec.Cmd.Wait blocks indefinitely on the pipe-copy goroutine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review on the PR head pointed out that the previous detection
selector (`app.kubernetes.io/name=karpenter`) misses Karpenter installs
deployed via the upstream Helm chart with a custom `nameOverride`: the
chart renders both the label value and the ClusterRole name from that
override, so the LabelSelector returns no items and the install command
falls through to deploy a second controller side-by-side.

Switch to a structural fingerprint instead. The chart's clusterrole.yaml
and clusterrole-core.yaml hard-code rules on the `karpenter.sh` API
group (for nodepools/nodeclaims) — that group is independent of any
metadata customization and is the canonical contract every Karpenter
controller needs. We now list every ClusterRole and look for any rule
referencing karpenter.sh, then skip the ones bearing our InstalledBy
sentinel.

Adds a regression test that asserts a foreign install with `nameOverride`
is detected, and a defensive case that ensures a ClusterRole carrying
karpenter-looking labels but no real karpenter.sh rules does not trigger
the guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 27, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

L3n41c and others added 2 commits April 28, 2026 11:06
The foreign-Karpenter detector previously called `List` without a
`Limit`, which means client-go does NOT auto-paginate and asks the
API server for the entire ClusterRole collection in one shot. On a
dense cluster with thousands of ClusterRoles that materialises the full
slice in memory just to answer a yes/no question.

Switch to the same Limit + Continue pattern already used by
`GetNodesProperties` in the same package, with `clusterRoleListChunkSize
= 100`. Early-exit on the first foreign match means we don't paginate
to the end either; on a small cluster this is a single round-trip and
strictly equivalent to the previous behaviour.

Add a focused pagination subtest that wires a reactor returning four
synthetic pages and asserts: empty Continue on the first call, token
forwarding on subsequent calls, fixed Limit, skipping our own
ClusterRoles, foreign found on page 3 returns true, and page 4 never
requested. A `require.Less` guards the reactor's slice access against
over-fetch with a clear failure message.

Also refactor `hasKarpenterAPIGroupRule` to `lo.ContainsBy`, matching
the predicate-search idiom already used by `IsEKSAutoModeEnabled` in
the same package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 28, 2026

In case a Karpenter not installed by kubectl is detected, the command returns immediately with:

$ kubectl datadog autoscaling cluster install
2026/04/28 12:15:49 Detected foreign Karpenter ClusterRole
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Karpenter is already installed on cluster lenaic-karpenter-test.                                         │
│                                                                                                          │
│ kubectl-datadog has nothing to install.                                                                  │
│                                                                                                          │
│ Navigate to the Autoscaling settings page                                                                │
│ and select cluster to start generating recommendations:                                                  │
│ https://app.datadoghq.com/orchestration/scaling/settings?query=kube_cluster_name%3Alenaic-karpenter-test │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯

@L3n41c L3n41c marked this pull request as ready for review April 28, 2026 10:19
@L3n41c L3n41c requested a review from a team April 28, 2026 10:19
@L3n41c L3n41c requested a review from a team as a code owner April 28, 2026 10:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0cef1fb57

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// looser check stays robust against upstream resource additions.
func hasKarpenterAPIGroupRule(rules []rbacv1.PolicyRule) bool {
return lo.ContainsBy(rules, func(rule rbacv1.PolicyRule) bool {
return slices.Contains(rule.APIGroups, karpenterAPIGroup)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require a real Karpenter fingerprint before no-oping

On clusters that already have the Datadog Operator installed but no Karpenter controller, this API-group-only check returns true because the operator's own ClusterRole grants karpenter.sh permissions (config/rbac/role.yaml:378-389, generated from internal/controller/datadogagent_controller.go:103). Since install.go now exits when this helper reports true, those users are incorrectly told Karpenter is installed and cannot use the plugin to install autoscaling; please exclude the operator role or match a more Karpenter-specific RBAC shape.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants