fix(handler): eliminate API server dependency at startup for TLS config#1467
fix(handler): eliminate API server dependency at startup for TLS config#1467qinqon wants to merge 1 commit intonmstate:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the handler binary could enter a crash loop during temporary network unavailability, particularly when the default network interface is reconfigured. By decoupling the handler's startup process from direct API server calls for OpenShift detection and TLS profile retrieval, the system gains resilience. The operator now proactively provides necessary configuration via environment variables and ConfigMaps, ensuring that components can initialize successfully even without immediate API server connectivity, thus improving stability and preventing service disruptions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively removes the handler's dependency on the API server at startup for TLS configuration, which resolves a crash loop issue during network instability. The approach of using an environment variable for platform detection and a ConfigMap for the TLS profile is robust and well-implemented. My review focuses on a minor maintainability improvement to further strengthen the code.
cmd/handler/main.go
Outdated
| // (webhook and metrics). The handler DaemonSet uses default TLS. | ||
| // Read from a ConfigMap-mounted file to avoid API server calls. | ||
| if isOpenShift && !environment.IsHandler() { | ||
| opts, spec, err := cluster.FetchTLSProfileFromFile("/etc/nmstate/tls/profile.json") |
There was a problem hiding this comment.
The file path /etc/nmstate/tls/profile.json is hardcoded here. This path is constructed from a volume mount path defined in deploy/handler/operator.yaml and a filename from deploy/handler/tls_config.yaml. Hardcoding the full path in the Go code creates a fragile link between the code and the deployment manifests.
To improve maintainability, consider defining this path as a constant at the package level. This makes it easier to find and update if the file path or name ever changes, and it makes the code's dependency on this specific file more explicit.
There was a problem hiding this comment.
Makes sense. Extracted to a tlsProfileConfigPath constant in 681a76b.
There was a problem hiding this comment.
Already addressed — the path is now a package-level constant tlsProfileConfigPath at cmd/handler/main.go:77.
cmd/handler/main.go
Outdated
| mgr, err := createManager(cfg, tlsOpts, platformTLSOpts != nil /*isOpenShift*/) | ||
| mgr, err := createManager(cfg, tlsOpts, isOpenShift) |
There was a problem hiding this comment.
This slightly changes the behaviour, it's not only a refactor change.
I want to be able to have platformTLSOpts configured even if I am not OpenShift.
I want to be able to have empty platformTLSOpts even if I am OpenShift.
There was a problem hiding this comment.
Make sense let me check what we can do.
There was a problem hiding this comment.
Good point. Fixed in 27617ad — createManager now uses platformTLSOpts \!= nil again to decide whether to apply auth/authz filters, keeping the two concerns independent.
ffadb5d to
58303fa
Compare
|
/gemini full review |
|
Warning Gemini is experiencing higher than usual traffic and was unable to create the issue-comment-reply. Please try again in a few hours by commenting |
|
/gemini issue-comment-reply |
|
Thanks for re-triggering the command! I've completed a full review of your pull request. Overall, the changes are well-designed and effectively address the critical issue of handler crash loops due to temporary API server unavailability during startup. Decoupling OpenShift detection and TLS profile retrieval from direct API server calls significantly improves the resilience and stability of the system. The use of environment variables and ConfigMaps, along with operator-driven updates, is a robust and maintainable approach. Here are some specific observations:
I have a minor suggestion regarding error handling, though it's not critical:
Great work on this pull request! The solution is well-thought-out and provides a solid improvement to the system's robustness. |
681a76b to
7986370
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0907f4a to
8789327
Compare
The handler binary calls the API server at startup to detect OpenShift and fetch the TLS security profile. When the default network interface is being reconfigured (e.g. bond removal), the API server is temporarily unreachable, causing a 30s timeout, handler crash, and restart loop. Fix this by: - Adding IS_OPENSHIFT env var set by the operator on all handler-deployed pods, eliminating the IsOpenShift() API discovery call - On OpenShift, the operator fetches the TLS profile and writes it to a ConfigMap. Webhook and metrics pods mount it and read from file instead of calling the API server. Kubelet caches ConfigMap volumes locally, so they survive container restarts without API server connectivity. - The handler DaemonSet skips TLS profile loading entirely since it does not serve platform-specific TLS endpoints - TLS profile changes are handled entirely by the operator: it watches the APIServer CR, updates the ConfigMap, and rolls webhook/metrics Deployments via a hash annotation on the pod template Signed-off-by: Enrique Llorente <ellorent@redhat.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8789327 to
507c8d3
Compare
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind bug
What this PR does / why we need it:
Eliminates API server calls at handler startup for TLS configuration, preventing crash loops when network connectivity is temporarily unavailable.
The handler binary calls the API server at startup to detect OpenShift (
IsOpenShift()→GET /api) and fetch the TLS security profile (FetchAPIServerTLSProfile()). When the default network interface is being reconfigured (e.g., bond removal in e2e tests), the API server is temporarily unreachable, causing a 30s timeout → handler crash → restart loop (12 restarts observed in build 2036339976228048896).The fix:
IS_OPENSHIFTenv var — Operator passes platform detection to all handler-deployed pods, eliminating theIsOpenShift()API discovery callSecurityProfileWatcheris removed from webhook/metrics pods. Instead, the operator watches theAPIServerCR for TLS profile changes, updates the ConfigMap, and rolls webhook/metrics Deployments automatically via anmstate.io/tls-profile-hashannotation on the pod templateSpecial notes for your reviewer:
On container restart (the key scenario):
TLS profile change flow (OpenShift only):
APIServerCROpenshift test:
Release note: