Hunter32292/metric server watchlist#1790
Hunter32292/metric server watchlist#1790hunter32292 wants to merge 3 commits intokubernetes-sigs:masterfrom
Conversation
Kubernetes v1.32+ kube-controller-manager force-enables the WatchListClient feature gate. The HPA controller watches metrics.k8s.io/v1beta1/pods and nodes via this path. Previously, metrics-server only implemented rest.Getter and rest.Lister, causing 422 responses on WatchList attempts and forcing fallback to full LIST on every reconnect. This change adds rest.Watcher support to podMetrics and nodeMetrics API types, allowing WatchList requests to succeed (200) instead of returning 422. Key changes: - Add monotonic resourceVersion counter incremented on each scrape cycle - Implement watch event generation by computing diff (ADDED/MODIFIED/DELETED) between previous and current metrics state - Support sendInitialEvents=true with BOOKMARK for WatchList semantics - Add namespace scoping for pod watches and label selector filtering - Implement slow consumer protection with 1000-event buffer per watcher New files: - pkg/api/watcher.go: Watch infrastructure and helpers - pkg/api/watcher_test.go: Comprehensive watch tests Modified: - pkg/storage/storage.go: Resource versioning and watch broadcaster - pkg/storage/interface.go: WatchableStorage interface - pkg/api/interfaces.go: Watch-related interfaces - pkg/api/pod.go: rest.Watcher implementation - pkg/api/node.go: rest.Watcher implementation - pkg/storage/storage_test.go: Watch event tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit addresses several critical bugs and edge cases in the
WatchList/Watch implementation:
Issue 1 & 2: Race condition between snapshot and watcher registration
- Add atomic RegisterPodWatcherWithSnapshot/RegisterNodeWatcherWithSnapshot
methods that hold s.mu.RLock() while getting the snapshot AND registering
the watcher, preventing Store() from interleaving
- Update Watch helpers to use atomic registration instead of separate
GetAllMetrics() + CurrentResourceVersion() + Register() calls
Issue 3: Bookmark object uses wrong type for node watches
- Add resourceType field to metricsWatcher ("pod" or "node")
- Update sendBookmark() to create NodeMetrics for node watchers,
PodMetrics for pod watchers
- Add TestMetricsWatcher_BookmarkType to verify correct GVK
Issue 4: Potential deadlock in broadcast under store lock
- Refactor Store() to calculate events under s.mu.Lock(), then
release the lock BEFORE broadcasting to watchers
- This ensures s.mu is never held during watcher operations,
eliminating any deadlock risk
Issue 5: Verify DELETED events for disappeared pods/nodes
- Confirmed existing logic correctly handles DELETED events
- Added TestStorage_PodWatchDeleteEvent to verify pods
- Added TestStorage_ConcurrentStoreWatch for race testing
Issue 6: Add Destroy() cleanup for watchers
- Add Stop() method to MetricsWatcher interface
- Add Shutdown() method to storage that closes all active watchers
- This enables graceful cleanup on server shutdown
All tests pass with -race flag.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Storage-generated watch events did not include pod/node labels because the storage layer has no access to the pod/node listers. This caused any watcher with a non-empty label selector to receive zero events, since matchesFilter() checked labels.Set(m.Labels) which was always nil. Fix: Add a LabelLookupFunc to metricsWatcher that enriches events with labels from the authoritative source (pod/node listers) before filtering and forwarding. The Watch helpers in the API layer create this function from the pod/node listers available at construction time. The enrichLabels() method is called in Send() before matchesFilter(), so both filtering and the forwarded event contain correct labels. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hunter32292 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
This issue is currently awaiting triage. If metrics-server contributors determine this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @hunter32292. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Recommend to familiarize yourself with history of idea of adding Watch for Metrics API |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #