[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949
[CASCL-1305] No-op kubectl-datadog install on a foreign Karpenter#2949
Conversation
Add a pre-flight guard to `kubectl datadog autoscaling cluster install` that recognises when Karpenter is already running on the cluster — be it a Helm install in another namespace, a raw `kubectl apply`, or anything in between — and short-circuits with a friendly success message and a link to the Datadog Autoscaling settings page, mirroring the existing EKS auto-mode handling. Detection scans cluster-scoped ClusterRoles labelled `app.kubernetes.io/name=karpenter` for the absence of our `autoscaling.datadoghq.com/installed-by=kubectl-datadog` sentinel; CRDs are ignored because Helm leaves them behind on `helm uninstall` and a re-install must still succeed in that case. Switch the chart's `additionalLabels` from overriding standard `app.kubernetes.io/managed-by` and `app.kubernetes.io/version` to Datadog-namespaced keys: the chart's `_helpers.tpl` emits the standard keys before `additionalLabels`, producing duplicate YAML keys whose deduplication at the API server is non-deterministic (verified empirically — `managed-by` resolves to the chart's `Helm` while `version` resolves to ours, on the very same resource). The new keys are exported as constants so writer and reader stay in sync. Migration: existing plugin installs lack the new sentinel and will be flagged as foreign on the next `install`. Users have to run `kubectl datadog autoscaling cluster uninstall` followed by `kubectl datadog autoscaling cluster install` once with the new binary to migrate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2949 +/- ##
==========================================
+ Coverage 40.91% 41.19% +0.27%
==========================================
Files 324 325 +1
Lines 28743 28951 +208
==========================================
+ Hits 11760 11925 +165
- Misses 16129 16170 +41
- Partials 854 856 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
🎯 Code Coverage (details) 🔗 Commit SHA: e0cef1f | Docs | Datadog PR Page | Give us feedback! |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd65ec0c9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const ( | ||
| karpenterChartNameLabel = "app.kubernetes.io/name" | ||
| karpenterChartNameValue = "karpenter" | ||
| karpenterClusterRoleSelector = karpenterChartNameLabel + "=" + karpenterChartNameValue |
There was a problem hiding this comment.
Detect nameOverride-based Karpenter installs
This selector misses existing Karpenter releases installed with the chart’s supported nameOverride value, because the upstream label app.kubernetes.io/name is rendered from that override rather than always being karpenter. In that scenario the list returns no ClusterRoles, so kubectl datadog autoscaling cluster install falls through and installs a second controller instead of no-oping; this affects Helm/Argo/raw-manifest installs that customize nameOverride.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Updated foreign Karpenter detection to scan all
ClusterRoles and identify Karpenter by RBAC rule content (karpenter.shAPI group +nodepools/nodeclaimsresources), instead of relying onapp.kubernetes.io/name=karpenter. This makes detection robust to chartnameOverride. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL24-L48 cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL50-L79 - Added helper matching functions (
hasKarpenterAPIGroupRule,hasKarpenterResource,contains) to keep the detection logic explicit and testable. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.goL50-L79 - Updated/expanded unit tests to model RBAC-rule-based detection and added a dedicated case for
foreign Karpenter installed with custom nameOverrideto verify renamed releases are still detected as foreign. cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.goL16-L114 cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.goL75-L82 - Committed the changes on the current branch (
64986c1) and created a follow-up PR via themake_prtool.
Testing
- ✅
gofmt -w cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter.go cmd/kubectl-datadog/autoscaling/cluster/install/guess/foreignkarpenter_test.go - ✅
go test ./cmd/kubectl-datadog/autoscaling/cluster/install/guess/...
There was a problem hiding this comment.
Addressed in cdbb6c7: switched detection from the app.kubernetes.io/name=karpenter LabelSelector to a structural fingerprint on the karpenter.sh API group. We list every ClusterRole and look for a rule referencing that group, which the chart hard-codes for nodepools/nodeclaims regardless of nameOverride or other metadata customizations. Added a regression test foreign Karpenter installed with custom nameOverride and a defensive case to make sure a ClusterRole with karpenter-looking labels but no karpenter.sh rules does not trigger the guard.
There was a problem hiding this comment.
Same fix landed on this branch as cdbb6c7 (independently arrived at the same karpenter.sh API group fingerprint). No follow-up PR needed.
Datadog PR Gates flagged the patch coverage at 40.74%, well below the threshold. The new code in install.go (the Helm-values block carrying the Datadog ownership sentinel and the foreign-Karpenter no-op message) was unreached by tests. Adds: - TestKarpenterHelmValues: pins the additionalLabels contract with the IsForeignKarpenterInstalled detector — both sides go through the same exported constants — and covers both install modes (the IRSA ServiceAccount annotation only appears on Fargate). - TestDisplayForeignKarpenterMessage: renders the box into a buffered cobra Command and asserts the user-visible strings, including the URL-encoded kube_cluster_name query that points the user at their cluster's Datadog autoscaling settings page. The display test sets PATH to empty so github.com/pkg/browser's LookPath probe of xdg-open / x-www-browser / www-browser fails fast and browser.OpenURL returns ErrNotFound — without that, the spawned xdg-open holds the pipe writer for browser.Stdout and browser.Stderr open and exec.Cmd.Wait blocks indefinitely on the pipe-copy goroutine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex review on the PR head pointed out that the previous detection selector (`app.kubernetes.io/name=karpenter`) misses Karpenter installs deployed via the upstream Helm chart with a custom `nameOverride`: the chart renders both the label value and the ClusterRole name from that override, so the LabelSelector returns no items and the install command falls through to deploy a second controller side-by-side. Switch to a structural fingerprint instead. The chart's clusterrole.yaml and clusterrole-core.yaml hard-code rules on the `karpenter.sh` API group (for nodepools/nodeclaims) — that group is independent of any metadata customization and is the canonical contract every Karpenter controller needs. We now list every ClusterRole and look for any rule referencing karpenter.sh, then skip the ones bearing our InstalledBy sentinel. Adds a regression test that asserts a foreign install with `nameOverride` is detected, and a defensive case that ensures a ClusterRole carrying karpenter-looking labels but no real karpenter.sh rules does not trigger the guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
The foreign-Karpenter detector previously called `List` without a `Limit`, which means client-go does NOT auto-paginate and asks the API server for the entire ClusterRole collection in one shot. On a dense cluster with thousands of ClusterRoles that materialises the full slice in memory just to answer a yes/no question. Switch to the same Limit + Continue pattern already used by `GetNodesProperties` in the same package, with `clusterRoleListChunkSize = 100`. Early-exit on the first foreign match means we don't paginate to the end either; on a small cluster this is a single round-trip and strictly equivalent to the previous behaviour. Add a focused pagination subtest that wires a reactor returning four synthetic pages and asserts: empty Continue on the first call, token forwarding on subsequent calls, fixed Limit, skipping our own ClusterRoles, foreign found on page 3 returns true, and page 4 never requested. A `require.Less` guards the reactor's slice access against over-fetch with a clear failure message. Also refactor `hasKarpenterAPIGroupRule` to `lo.ContainsBy`, matching the predicate-search idiom already used by `IsEKSAutoModeEnabled` in the same package. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
In case a Karpenter not installed by $ kubectl datadog autoscaling cluster install
2026/04/28 12:15:49 Detected foreign Karpenter ClusterRole
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Karpenter is already installed on cluster lenaic-karpenter-test. │
│ │
│ kubectl-datadog has nothing to install. │
│ │
│ Navigate to the Autoscaling settings page │
│ and select cluster to start generating recommendations: │
│ https://app.datadoghq.com/orchestration/scaling/settings?query=kube_cluster_name%3Alenaic-karpenter-test │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0cef1fb57
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // looser check stays robust against upstream resource additions. | ||
| func hasKarpenterAPIGroupRule(rules []rbacv1.PolicyRule) bool { | ||
| return lo.ContainsBy(rules, func(rule rbacv1.PolicyRule) bool { | ||
| return slices.Contains(rule.APIGroups, karpenterAPIGroup) |
There was a problem hiding this comment.
Require a real Karpenter fingerprint before no-oping
On clusters that already have the Datadog Operator installed but no Karpenter controller, this API-group-only check returns true because the operator's own ClusterRole grants karpenter.sh permissions (config/rbac/role.yaml:378-389, generated from internal/controller/datadogagent_controller.go:103). Since install.go now exits when this helper reports true, those users are incorrectly told Karpenter is installed and cannot use the plugin to install autoscaling; please exclude the operator role or match a more Karpenter-specific RBAC shape.
Useful? React with 👍 / 👎.
What does this PR do?
Adds a pre-flight guard to
kubectl datadog autoscaling cluster installthat recognises when Karpenter is already running on the cluster — be it a Helm install in another namespace, a rawkubectl apply, or anything in between — and short-circuits with a friendly success message and a link to the Datadog Autoscaling settings page, mirroring the existing EKS auto-mode handling (no error, exit 0).Detection lists every ClusterRole on the cluster and looks for any rule referencing the
karpenter.shAPI group, which the chart'sclusterrole.yamlandclusterrole-core.yamlhard-code fornodepools/nodeclaimsregardless ofnameOverrideor other metadata customizations. ClusterRoles bearing ourautoscaling.datadoghq.com/installed-by=kubectl-datadogsentinel are skipped. CRDs are deliberately not used — Helm leaves them behind onhelm uninstall, and re-runninginstallafteruninstallmust succeed in spite of those leftovers.The PR also switches the chart's
additionalLabelsfrom overriding standardapp.kubernetes.io/managed-byandapp.kubernetes.io/versionto Datadog-namespaced keys, exported as constants so the writer (install.go) and reader (IsForeignKarpenterInstalled) stay in sync.Motivation
PR #2717 (CASCL-1281) added a guard for EKS auto-mode clusters. We need a symmetric guard for any other cluster where Karpenter is already running, otherwise re-running
installon such a cluster would deploy a second controller that would race with the first one on the same CRDs.We use a structural RBAC fingerprint rather than a label selector because the Karpenter chart's
karpenter.nametemplate renders both theapp.kubernetes.io/namelabel value and the ClusterRole's name fromnameOverride. A label-based selector would silently miss any install with--set nameOverride=...and let ourinstalldeploy a second controller side-by-side. Thekarpenter.shAPI group is invariant acrossnameOverridechanges — every Karpenter controller needs RBAC on it.The Datadog-namespaced label switch is required because the Karpenter chart's
_helpers.tplemitsapp.kubernetes.io/managed-by: {{ .Release.Service }}andapp.kubernetes.io/version: ...beforeadditionalLabels, producing duplicate YAML keys. Verified empirically on a real cluster: deduplication at the API server is non-deterministic across keys (managed-byresolves to the chart'sHelmvalue whileversionresolves to ours, on the very same resource), so we cannot rely on those overrides as identity markers.Additional Notes
Migration of existing kubectl-datadog installs. Plugin installs deployed before this change carry the old
app.kubernetes.io/managed-by=kubectl-datadogvalue (silently dropped by the chart) and lack the new sentinel. The new code will detect them as foreign and no-op. To migrate, run once:Acceptable because the install command is recent and not yet broadly deployed.
Out of scope. Resources we apply directly (NodePool / EC2NodeClass via
cmd/kubectl-datadog/.../install/k8s/) keep their existingapp.kubernetes.io/managed-by=kubectl-datadoglabel — no chart helper to fight there, the label sticks, anduninstall.gocontinues to use it as a selector.Minimum Agent Versions
Describe your test plan
Unit tests:
go test ./cmd/kubectl-datadog/autoscaling/cluster/install/...— coversIsForeignKarpenterInstalled: no Karpenter, ours-only, foreign-only, mixed, foreign sentinel value, ClusterRole with karpenter-looking labels but nokarpenter.shrules, foreign install with customnameOverride, API list-error paths;TestKarpenterAPIGroupContractpins the literalkarpenter.shAPI group so test fixtures using the constant cannot silently mask a typo;TestKarpenterHelmValuespins theadditionalLabelscontract with the detector — both sides go through the same exported constants — and covers both install modes (only Fargate annotates the ServiceAccount with the IRSA role ARN);TestDisplayForeignKarpenterMessagerenders the box into a buffered cobraCommandand asserts the user-visible strings, including the URL-encodedkube_cluster_namequery that points the user at their cluster's Datadog autoscaling settings page.End-to-end on a real EKS test cluster (
AWS_PROFILE=exec-sso-container-integrations-account-admin AWS_REGION=eu-west-3):helm install karpenter oci://public.ecr.aws/karpenter/karpenter --namespace karpenter --create-namespace --set settings.clusterName=$CLUSTER. Thenkubectl datadog autoscaling cluster install→ message + URL, exit 0, nodd-karpenter-…CFN stacks created (aws cloudformation list-stacks).nameOverride—helm install foo oci://... --set nameOverride=foo,settings.clusterName=$CLUSTER. Thenkubectl datadog autoscaling cluster install→ still detected and no-op'd, even though no ClusterRole carriesapp.kubernetes.io/name=karpenter.karpenter.shand runinstall→ variant of the no-op message, exit 0.helm uninstall karpenter -n karpenter, thenkubectl datadog autoscaling cluster install→ succeeds in spite of the four residual CRDs.Checklist
enhancementqa/skip-qalabel