Skip to content

Bump QPS and burst for kube-controller-manager in DRA load test #5746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 21, 2025

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Jul 14, 2025

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Bumps QPS and burst for kube-controller-manager to meet load test demands.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5745

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jul 14, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 14, 2025
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 14, 2025

/test ?

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 14, 2025
@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-azure-apiversion-upgrade
/test pull-cluster-api-provider-azure-build
/test pull-cluster-api-provider-azure-ci-entrypoint
/test pull-cluster-api-provider-azure-e2e
/test pull-cluster-api-provider-azure-e2e-aks
/test pull-cluster-api-provider-azure-test
/test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-azure-apidiff
/test pull-cluster-api-provider-azure-apiserver-ilb
/test pull-cluster-api-provider-azure-capi-e2e
/test pull-cluster-api-provider-azure-conformance
/test pull-cluster-api-provider-azure-conformance-custom-builds
/test pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
/test pull-cluster-api-provider-azure-dra-scalability
/test pull-cluster-api-provider-azure-e2e-optional
/test pull-cluster-api-provider-azure-e2e-workload-upgrade
/test pull-cluster-api-provider-azure-load-test-custom-builds
/test pull-cluster-api-provider-azure-load-test-dra-custom-builds
/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds
/test pull-cluster-api-provider-azure-perf-test-apiserver-availability
/test pull-cluster-api-provider-azure-windows-custom-builds
/test pull-cluster-api-provider-azure-windows-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-azure-apidiff
pull-cluster-api-provider-azure-build
pull-cluster-api-provider-azure-ci-entrypoint
pull-cluster-api-provider-azure-conformance
pull-cluster-api-provider-azure-conformance-custom-builds
pull-cluster-api-provider-azure-conformance-dual-stack-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-ipv6-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts
pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra
pull-cluster-api-provider-azure-e2e
pull-cluster-api-provider-azure-e2e-aks
pull-cluster-api-provider-azure-e2e-workload-upgrade
pull-cluster-api-provider-azure-test
pull-cluster-api-provider-azure-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 14, 2025

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 14, 2025

/cc @alaypatel07

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: GitHub didn't allow me to request PR reviews from the following users: alaypatel07.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @alaypatel07

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Jul 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.82%. Comparing base (5b0130c) to head (3f5d5e4).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5746      +/-   ##
==========================================
- Coverage   52.84%   52.82%   -0.02%     
==========================================
  Files         278      279       +1     
  Lines       29610    29629      +19     
==========================================
+ Hits        15647    15652       +5     
- Misses      13146    13160      +14     
  Partials      817      817              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alaypatel07
Copy link

I am unsure why the dra lane has started to fail: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5746/pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds/1944640471062024192

The error is something like this:

# Get kubeconfig and store it locally.
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/hack/tools/bin/kubectl-v1.32.2 get secret/capz-nrko60-kubeconfig -n default -o json | jq -r .data.value | base64 --decode > ./kubeconfig
# TODO: Standardize timeouts across the Makefile and make them configurable based on the job.
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/hack/tools/bin/kubectl-v1.32.2 -n default wait --for=condition=Ready --timeout=60m cluster "capz-nrko60"
error: timed out waiting for the condition on clusters/capz-nrko60
make[1]: *** [Makefile:384: create-workload-cluster] Error 1
make[1]: Leaving directory '/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure'
make: *** [Makefile:403: create-cluster] Error 2
================ MAKE CLEAN ===============
make clean-bin
make[1]: Entering directory '/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure'

The same failure is also seen in the periodics:

  1. https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-azure-dra-scalability/1943815000074227712
  2. https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-azure-dra-with-workload-scalability/1943952406521843712

The first known failure from periodic was from 7:31 PM EDT, on 07/11.

@jackfrancis @nojnhuh would you have any idea about why this is failing?

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 14, 2025

@jackfrancis @nojnhuh would you have any idea about why this is failing?

We were up against our max quota but things are looking better now.

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@alaypatel07
Copy link

alaypatel07 commented Jul 14, 2025

got another failure, and looks like the changes did not get applied as intended:

I0714 06:50:49.226276       1 flags.go:64] FLAG: --kube-api-burst="30"
I0714 06:50:49.226360       1 flags.go:64] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0714 06:50:49.226440       1 flags.go:64] FLAG: --kube-api-qps="20"

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5746/pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds/1944640471062024192

EDIT: looks like I was looking at an older run, I will wait for the new run here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5746/pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds/1944803095330426880

@alaypatel07
Copy link

@nojnhuh is this another failure failure hitting quota limits https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5746/pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds/1944803095330426880?

how can we make sure we get some guaranteed time for iterating on dra jobs?

@alaypatel07
Copy link

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

1 similar comment
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 15, 2025

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 15, 2025

Looks like this is working as intended.

/assign @Jont828

@alaypatel07
Copy link

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@alaypatel07
Copy link

@nojnhuh https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5746/pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds/1945527196135198720

The prometheus setup on this jobs is failing because PVC attachment is failing. Can you please look into it? This is blocking stabilizing the 100 node azure test.

Events:
  Type     Reason              Age                 From                     Message
  ----     ------              ----                ----                     -------
  Normal   Scheduled           40m                 default-scheduler        Successfully assigned monitoring/prometheus-k8s-0 to capz-2g6oou-md-0-wwvhv-s7vz8
  Warning  FailedAttachVolume  86s (x27 over 40m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-e5124f0e-13ca-4bb7-999e-3e3f5463602d" : rpc error: code = Internal desc = Attach volume /subscriptions/46678f10-4bbb-447e-98e8-d2829589f2d8/resourceGroups/capz-2g6oou/providers/Microsoft.Compute/disks/pvc-e5124f0e-13ca-4bb7-999e-3e3f5463602d to instance capz-2g6oou-md-0-wwvhv-s7vz8 failed with PUT https://management.azure.com/subscriptions/46678f10-4bbb-447e-98e8-d2829589f2d8/resourceGroups/capz-2g6oou/providers/Microsoft.Compute/virtualMachines/capz-2g6oou-md-0-wwvhv-s7vz8

@jackfrancis
Copy link
Contributor

/retest

@jackfrancis
Copy link
Contributor

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@nojnhuh nojnhuh force-pushed the dra-ctrl-qps-burst branch from eb80a31 to 6e67ee2 Compare July 21, 2025 18:50
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 21, 2025

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@nojnhuh nojnhuh force-pushed the dra-ctrl-qps-burst branch from 6e67ee2 to 48e3e20 Compare July 21, 2025 18:54
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 21, 2025

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@nojnhuh nojnhuh force-pushed the dra-ctrl-qps-burst branch from 48e3e20 to 3f5d5e4 Compare July 21, 2025 19:24
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Jul 21, 2025

/test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds 3f5d5e4 link false /test pull-cluster-api-provider-azure-load-test-dra-with-workload-custom-builds

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jackfrancis
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8e8b1fc0e5a0997b69aa2c10d4e06d3336866c07

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2025
@jackfrancis jackfrancis merged commit 8e98848 into kubernetes-sigs:main Jul 21, 2025
20 of 30 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jul 21, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Jul 21, 2025
@nojnhuh nojnhuh deleted the dra-ctrl-qps-burst branch July 21, 2025 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

bump up KCM qps and burst for dra scalability job
5 participants