Skip to content

Hunter32292/metric server watchlist#1790

Closed
hunter32292 wants to merge 3 commits intokubernetes-sigs:masterfrom
hunter32292:hunter32292/metric-server-watchlist
Closed

Hunter32292/metric server watchlist#1790
hunter32292 wants to merge 3 commits intokubernetes-sigs:masterfrom
hunter32292:hunter32292/metric-server-watchlist

Conversation

@hunter32292
Copy link
Copy Markdown

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

hunter32292 and others added 3 commits April 14, 2026 15:24
Kubernetes v1.32+ kube-controller-manager force-enables the WatchListClient
feature gate. The HPA controller watches metrics.k8s.io/v1beta1/pods and
nodes via this path. Previously, metrics-server only implemented rest.Getter
and rest.Lister, causing 422 responses on WatchList attempts and forcing
fallback to full LIST on every reconnect.

This change adds rest.Watcher support to podMetrics and nodeMetrics API
types, allowing WatchList requests to succeed (200) instead of returning 422.

Key changes:
- Add monotonic resourceVersion counter incremented on each scrape cycle
- Implement watch event generation by computing diff (ADDED/MODIFIED/DELETED)
  between previous and current metrics state
- Support sendInitialEvents=true with BOOKMARK for WatchList semantics
- Add namespace scoping for pod watches and label selector filtering
- Implement slow consumer protection with 1000-event buffer per watcher

New files:
- pkg/api/watcher.go: Watch infrastructure and helpers
- pkg/api/watcher_test.go: Comprehensive watch tests

Modified:
- pkg/storage/storage.go: Resource versioning and watch broadcaster
- pkg/storage/interface.go: WatchableStorage interface
- pkg/api/interfaces.go: Watch-related interfaces
- pkg/api/pod.go: rest.Watcher implementation
- pkg/api/node.go: rest.Watcher implementation
- pkg/storage/storage_test.go: Watch event tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit addresses several critical bugs and edge cases in the
WatchList/Watch implementation:

Issue 1 & 2: Race condition between snapshot and watcher registration
- Add atomic RegisterPodWatcherWithSnapshot/RegisterNodeWatcherWithSnapshot
  methods that hold s.mu.RLock() while getting the snapshot AND registering
  the watcher, preventing Store() from interleaving
- Update Watch helpers to use atomic registration instead of separate
  GetAllMetrics() + CurrentResourceVersion() + Register() calls

Issue 3: Bookmark object uses wrong type for node watches
- Add resourceType field to metricsWatcher ("pod" or "node")
- Update sendBookmark() to create NodeMetrics for node watchers,
  PodMetrics for pod watchers
- Add TestMetricsWatcher_BookmarkType to verify correct GVK

Issue 4: Potential deadlock in broadcast under store lock
- Refactor Store() to calculate events under s.mu.Lock(), then
  release the lock BEFORE broadcasting to watchers
- This ensures s.mu is never held during watcher operations,
  eliminating any deadlock risk

Issue 5: Verify DELETED events for disappeared pods/nodes
- Confirmed existing logic correctly handles DELETED events
- Added TestStorage_PodWatchDeleteEvent to verify pods
- Added TestStorage_ConcurrentStoreWatch for race testing

Issue 6: Add Destroy() cleanup for watchers
- Add Stop() method to MetricsWatcher interface
- Add Shutdown() method to storage that closes all active watchers
- This enables graceful cleanup on server shutdown

All tests pass with -race flag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Storage-generated watch events did not include pod/node labels because
the storage layer has no access to the pod/node listers. This caused
any watcher with a non-empty label selector to receive zero events,
since matchesFilter() checked labels.Set(m.Labels) which was always nil.

Fix: Add a LabelLookupFunc to metricsWatcher that enriches events with
labels from the authoritative source (pod/node listers) before filtering
and forwarding. The Watch helpers in the API layer create this function
from the pod/node listers available at construction time.

The enrichLabels() method is called in Send() before matchesFilter(),
so both filtering and the forwarded event contain correct labels.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hunter32292
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@linux-foundation-easycla
Copy link
Copy Markdown

CLA Not Signed

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If metrics-server contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @hunter32292. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Instrumentation Apr 14, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 14, 2026
@github-project-automation github-project-automation bot moved this from Needs Triage to Closed in SIG Instrumentation Apr 14, 2026
@serathius
Copy link
Copy Markdown
Contributor

Recommend to familiarize yourself with history of idea of adding Watch for Metrics API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

3 participants