Skip to content

Conversation

@dhenkel92
Copy link
Contributor

@dhenkel92 dhenkel92 commented Oct 23, 2025

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

Implements delayed retry mechanism KEP-3258 for admission checks to prevent overwhelming external controllers and reduce control plane churn.

Problem:
Previously, when admission checks transitioned to Retry state, Kueue would immediately evict workloads and requeue them, causing excessive load on admission check controllers and unnecessary API server churn, particularly when retry conditions persisted predictably.

Solution:
This implementation adds two new fields to AdmissionCheckState:

  • requeueAfterSeconds: Specifies minimum wait time before retry
  • retryCount: Tracks retry attempts per admission check

Key Changes:

API Changes

  • Added requeueAfterSeconds and retryCount fields to AdmissionCheckState
  • Changed lastTransitionTime from required to optional (now pointer type) to enable webhook-managed timestamping and reduce clock skew issues

Controller Changes

  • Workload controller now respects delayed retry times before resetting checks
  • Prevents premature re-admission while delays are active
  • Automatically sets lastTransitionTime when admission check state changes
  • Auto-increments retryCount on transition to Retry state
  • Calculates maximum retry time across all admission checks
  • Updates workload.status.requeueState.requeueAt with the maximum delay
  • Verbs extended to include status updates for timestamp management

Behavior:
When multiple admission checks specify different delays, Kueue uses the maximum delay across all checks. Workloads are evicted immediately to release quota, but admission check states aren't reset until all delays expire, preventing race conditions where fast-responding checks could block slower ones from registering their delays.

Which issue(s) this PR fixes:

Fixes #3258

Special notes for your reviewer:

Does this PR introduce a user-facing change?

TBD

Support delayed admission check retries

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dhenkel92
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @dhenkel92. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 23, 2025
@netlify
Copy link

netlify bot commented Oct 23, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 0e67543
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690321f6c4669c000896747e

@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch from 1f1dd14 to 8939064 Compare October 23, 2025 19:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2025
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 23, 2025
@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch 2 times, most recently from 1a0db58 to ed2f2dc Compare October 24, 2025 05:59
@dhenkel92
Copy link
Contributor Author

/retest

@mimowo
Copy link
Contributor

mimowo commented Oct 24, 2025

@dhenkel92 thank you for the PR
@PBundyra would you like to give it the first pass?

@mimowo
Copy link
Contributor

mimowo commented Oct 24, 2025

@dhenkel92 please update the release notes

@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch 2 times, most recently from 308d4d4 to 319ddc9 Compare October 27, 2025 10:06
@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch 2 times, most recently from 1a26439 to a98c452 Compare October 27, 2025 19:42
StatusFinished = "finished"
)

var (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I would suggest to remove the distractions, but feel free to keep if done by ide automatically

Comment on lines 182 to 194
ginkgo.By("Wait for job to have quota again", func() {
gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, wlLookupKey, createdWorkload)).To(gomega.Succeed())
g.Expect(workload.HasQuotaReservation(createdWorkload)).To(gomega.BeTrue())
g.Expect(workload.IsAdmitted(createdWorkload)).To(gomega.BeFalse())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC in this block we could also check that the RequeueAt field is cleared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment why it's not possible right now and I'll create a ticket afterwards.

Comment on lines 312 to 313
Count: ptr.To[int32](1),
}, false)
Count: ptr.To[int32](2),
}, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify why this change is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is a flaky test that depends on the timing. I've reverted the change and will see how it performs.

Comment on lines 715 to 725
util.SetRequeuedConditionWithPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl))
util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{
Count: ptr.To[int32](2),
}, false)
}, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check if this change is needed and why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment explaining why the change is required and I'll open a ticket afterwards.

@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch 2 times, most recently from 926299c to 93c08d7 Compare October 29, 2025 20:51
util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(wl2), &kueue.RequeueState{
Count: ptr.To[int32](1),
}, true)
}, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid change. We were not resetting the RequeueState.ReqeueAfter for this workload state before.

Copy link
Contributor

@mimowo mimowo Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe we don't need to clean it at all then (also related to the point above). I think an old value should not cause user-observable problems anyway, wdyt? Also related to https://github.com/kubernetes-sigs/kueue/pull/7370/files#r2475037424

EDIT: so my thinking is that if the workload s admitted, then the stale value is harmless (ignored) anyway, but when we are evicting we are going to update it anyway. So, I'm not clear we need to clear it at all. Especially if we didn't clear before this PR, then it seems orthogonal.

I agree it would be "ideal", but I want to minimize the scope of changes by the PR to bare minimum.

Comment on lines 715 to 725
util.SetRequeuedConditionWithPodsReadyTimeout(ctx, k8sClient, client.ObjectKeyFromObject(prodWl))
util.ExpectWorkloadToHaveRequeueState(ctx, k8sClient, client.ObjectKeyFromObject(prodWl), &kueue.RequeueState{
Count: ptr.To[int32](2),
}, false)
}, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment explaining why the change is required and I'll open a ticket afterwards.

Comment on lines 312 to 313
Count: ptr.To[int32](1),
}, false)
Count: ptr.To[int32](2),
}, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it is a flaky test that depends on the timing. I've reverted the change and will see how it performs.

Comment on lines 182 to 194
ginkgo.By("Wait for job to have quota again", func() {
gomega.Eventually(func(g gomega.Gomega) {
g.Expect(k8sClient.Get(ctx, wlLookupKey, createdWorkload)).To(gomega.Succeed())
g.Expect(workload.HasQuotaReservation(createdWorkload)).To(gomega.BeTrue())
g.Expect(workload.IsAdmitted(createdWorkload)).To(gomega.BeFalse())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment why it's not possible right now and I'll create a ticket afterwards.

Comment on lines 2412 to 2425
"should reset requeueState.requeueAt when expired": {
cq: utiltestingapi.MakeClusterQueue("cq").Obj(),
lq: utiltestingapi.MakeLocalQueue("lq", "ns").ClusterQueue("cq").Obj(),
workload: utiltestingapi.MakeWorkload("wl", "ns").
Queue("lq").
RequeueState(nil, ptr.To(metav1.NewTime(time.Now().Add(-time.Second)))).
Obj(),
wantWorkload: utiltestingapi.MakeWorkload("wl", "ns").
Queue("lq").
Obj(),
wantWorkloadUseMergePatch: utiltestingapi.MakeWorkload("wl", "ns").
Queue("lq").
Obj(),
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, the wantWorkload part of the test is not applying the patch and I don't understand why.

Copy link
Contributor Author

@dhenkel92 dhenkel92 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's a limitation of using SSA patches with the fake client. The only option i see right now is to remove the test.

@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch from 93c08d7 to ff342b3 Compare October 30, 2025 08:10
Implements delayed retry mechanism for admission checks to prevent overwhelming
external controllers and reduce control plane churn.

Problem:
Previously, when admission checks transitioned to Retry state, Kueue would
immediately evict workloads and requeue them, causing excessive load on
admission check controllers and unnecessary API server churn, particularly
when retry conditions persisted predictably.

Solution:
This implementation adds two new fields to AdmissionCheckState:
- requeueAfterSeconds: Specifies minimum wait time before retry
- retryCount: Tracks retry attempts per admission check

Key Changes:

API Changes
- Added requeueAfterSeconds and retryCount fields to AdmissionCheckState

Controller Changes
- Auto-increments retryCount on transition to Retry state
- Calculates maximum retry time across all admission checks
- Updates workload.status.requeueState.requeueAt with the maximum delay
- Workload controller now respects delayed retry times before resetting checks
- Prevents premature re-admission while delays are active

Behavior:
When multiple admission checks specify different delays, Kueue uses the maximum
delay across all checks. Workloads are evicted immediately to release quota, but
admission check states aren't reset until all delays expire, preventing race
conditions where fast-responding checks could block slower ones from registering
their delays.

Refs: KEP-3258
https://github.com/DataDog/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md
@dhenkel92 dhenkel92 force-pushed the 09-09-feat_admissionchecks_implement_delayed_admission branch from ff342b3 to 0e67543 Compare October 30, 2025 08:29
dhenkel92 added a commit to DataDog/kueue that referenced this pull request Oct 30, 2025
This is a follow-up to [kubernetes-sigs#7370](kubernetes-sigs#7370), where we introduced delayed admission checks.
This PR replaces the use of the shared RequeueState field with the new mechanism in the preprovision request admission check.

Refs: KEP-3258
https://github.com/DataDog/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md
dhenkel92 added a commit to DataDog/kueue that referenced this pull request Oct 30, 2025
This is a follow-up to [kubernetes-sigs#7370](kubernetes-sigs#7370), where we introduced delayed admission checks.
This PR replaces the use of the shared RequeueState field with the new mechanism in the preprovision request admission check.

Refs: KEP-3258
https://github.com/DataDog/kueue/blob/main/keps/3258-delayed-admission-check-retries/README.md
@dhenkel92
Copy link
Contributor Author

/retest

@mimowo
Copy link
Contributor

mimowo commented Oct 30, 2025

Let's check if there are not flakes 1/5
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-main-1-31
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main

@dhenkel92
Copy link
Contributor Author

Let's check if there are not flakes 2/5
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-main-1-31
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main

@dhenkel92
Copy link
Contributor Author

Let's check if there are not flakes 3/5
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-main-1-31
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main

@dhenkel92
Copy link
Contributor Author

Let's check if there are not flakes 4/5
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-main-1-31
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main

@dhenkel92
Copy link
Contributor Author

Let's check if there are not flakes 5/5
/test pull-kueue-test-e2e-certmanager-main
/test pull-kueue-test-e2e-customconfigs-main
/test pull-kueue-test-e2e-main-1-31
/test pull-kueue-test-e2e-main-1-32
/test pull-kueue-test-e2e-main-1-33
/test pull-kueue-test-e2e-main-1-34
/test pull-kueue-test-integration-baseline-main
/test pull-kueue-test-integration-extended-main
/test pull-kueue-test-integration-multikueue-main

@k8s-ci-robot
Copy link
Contributor

@dhenkel92: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-integration-extended-main 0e67543 link true /test pull-kueue-test-integration-extended-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry mechanism for AdmissionChecks in Kueue

6 participants