Skip to content

fix(handler): eliminate API server dependency at startup for TLS config#1467

Open
qinqon wants to merge 1 commit intonmstate:mainfrom
qinqon:handler-tls-no-apiserver
Open

fix(handler): eliminate API server dependency at startup for TLS config#1467
qinqon wants to merge 1 commit intonmstate:mainfrom
qinqon:handler-tls-no-apiserver

Conversation

@qinqon
Copy link
Copy Markdown
Member

@qinqon qinqon commented Mar 24, 2026

/kind bug

What this PR does / why we need it:

Eliminates API server calls at handler startup for TLS configuration, preventing crash loops when network connectivity is temporarily unavailable.

The handler binary calls the API server at startup to detect OpenShift (IsOpenShift()GET /api) and fetch the TLS security profile (FetchAPIServerTLSProfile()). When the default network interface is being reconfigured (e.g., bond removal in e2e tests), the API server is temporarily unreachable, causing a 30s timeout → handler crash → restart loop (12 restarts observed in build 2036339976228048896).

The fix:

  1. IS_OPENSHIFT env var — Operator passes platform detection to all handler-deployed pods, eliminating the IsOpenShift() API discovery call
  2. TLS profile ConfigMap — On OpenShift, the operator fetches the TLS profile and writes it to a ConfigMap. Webhook and metrics pods mount it and read from file. Kubelet caches ConfigMap volumes locally, so they survive container restarts without API server connectivity
  3. Handler skips TLS profile — The handler DaemonSet doesn't serve platform-specific TLS endpoints (webhook and metrics are separate Deployments), so it skips TLS profile loading entirely
  4. Operator-driven TLS profile updates — The SecurityProfileWatcher is removed from webhook/metrics pods. Instead, the operator watches the APIServer CR for TLS profile changes, updates the ConfigMap, and rolls webhook/metrics Deployments automatically via a nmstate.io/tls-profile-hash annotation on the pod template

Special notes for your reviewer:

On container restart (the key scenario):

  1. Pod starts with API server connectivity → kubelet mounts ConfigMap
  2. Bond removal disrupts connectivity → handler crashes
  3. Kubelet restarts container → volumes already cached locally
  4. Handler reads config from file → no API server call → starts successfully

TLS profile change flow (OpenShift only):

  1. Admin changes TLS profile on APIServer CR
  2. Operator's watch triggers reconcile
  3. Operator re-fetches TLS profile, updates ConfigMap, re-renders Deployments with new hash annotation
  4. Kubernetes rolls webhook/metrics Deployments with updated TLS config

Openshift test:

Release note:

Fix handler crash loop when default network interface is temporarily reconfigured by eliminating API server dependency at startup for TLS configuration.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Mar 24, 2026
@kubevirt-bot kubevirt-bot requested review from cybertron and emy March 24, 2026 11:27
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the handler binary could enter a crash loop during temporary network unavailability, particularly when the default network interface is reconfigured. By decoupling the handler's startup process from direct API server calls for OpenShift detection and TLS profile retrieval, the system gains resilience. The operator now proactively provides necessary configuration via environment variables and ConfigMaps, ensuring that components can initialize successfully even without immediate API server connectivity, thus improving stability and preventing service disruptions.

Highlights

  • OpenShift Detection: The handler now detects OpenShift status from an IS_OPENSHIFT environment variable, set by the operator, instead of making an API server call at startup.
  • TLS Profile Configuration: On OpenShift, the operator fetches the TLS security profile and writes it to a ConfigMap. Webhook and metrics pods now mount this ConfigMap and read the TLS profile from a file, eliminating direct API server calls for initial TLS configuration.
  • Handler DaemonSet TLS: The handler DaemonSet, which does not serve platform-specific TLS endpoints, now skips TLS profile loading entirely, relying on default TLS settings.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@qinqon qinqon requested review from mkowalski and removed request for cybertron March 24, 2026 11:33
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively removes the handler's dependency on the API server at startup for TLS configuration, which resolves a crash loop issue during network instability. The approach of using an environment variable for platform detection and a ConfigMap for the TLS profile is robust and well-implemented. My review focuses on a minor maintainability improvement to further strengthen the code.

// (webhook and metrics). The handler DaemonSet uses default TLS.
// Read from a ConfigMap-mounted file to avoid API server calls.
if isOpenShift && !environment.IsHandler() {
opts, spec, err := cluster.FetchTLSProfileFromFile("/etc/nmstate/tls/profile.json")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file path /etc/nmstate/tls/profile.json is hardcoded here. This path is constructed from a volume mount path defined in deploy/handler/operator.yaml and a filename from deploy/handler/tls_config.yaml. Hardcoding the full path in the Go code creates a fragile link between the code and the deployment manifests.

To improve maintainability, consider defining this path as a constant at the package level. This makes it easier to find and update if the file path or name ever changes, and it makes the code's dependency on this specific file more explicit.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Extracted to a tlsProfileConfigPath constant in 681a76b.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — the path is now a package-level constant tlsProfileConfigPath at cmd/handler/main.go:77.

Comment on lines +153 to +165
mgr, err := createManager(cfg, tlsOpts, platformTLSOpts != nil /*isOpenShift*/)
mgr, err := createManager(cfg, tlsOpts, isOpenShift)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly changes the behaviour, it's not only a refactor change.

I want to be able to have platformTLSOpts configured even if I am not OpenShift.

I want to be able to have empty platformTLSOpts even if I am OpenShift.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense let me check what we can do.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in 27617adcreateManager now uses platformTLSOpts \!= nil again to decide whether to apply auth/authz filters, keeping the two concerns independent.

@qinqon qinqon force-pushed the handler-tls-no-apiserver branch from ffadb5d to 58303fa Compare March 24, 2026 11:52
@mkowalski
Copy link
Copy Markdown
Member

/gemini full review

@gemini-code-assist
Copy link
Copy Markdown

Warning

Gemini is experiencing higher than usual traffic and was unable to create the issue-comment-reply. Please try again in a few hours by commenting /gemini issue-comment-reply.

@mkowalski
Copy link
Copy Markdown
Member

/gemini issue-comment-reply

@gemini-code-assist
Copy link
Copy Markdown

Thanks for re-triggering the command! I've completed a full review of your pull request.

Overall, the changes are well-designed and effectively address the critical issue of handler crash loops due to temporary API server unavailability during startup. Decoupling OpenShift detection and TLS profile retrieval from direct API server calls significantly improves the resilience and stability of the system. The use of environment variables and ConfigMaps, along with operator-driven updates, is a robust and maintainable approach.

Here are some specific observations:

  • Correctness: The logic for OpenShift detection via environment variable and TLS profile loading from a ConfigMap-mounted file directly tackles the problem described in the PR description. The operator's role in managing and propagating TLS profile updates via hash annotations ensures that changes are applied correctly and trigger necessary restarts.
  • Efficiency: By avoiding synchronous API server calls at startup, the handler can initialize more quickly and reliably, especially in environments with transient network issues. This is a significant improvement.
  • Maintainability: The code is clear, with good comments explaining the new flow. The introduction of dedicated functions like IsOpenShiftFromEnv() and FetchTLSProfileFromFile() in pkg/cluster/cluster.go enhances modularity. The use of Go templating in the deployment manifests (deploy/handler/operator.yaml and deploy/handler/tls_config.yaml) for conditional logic is appropriate and standard.
  • Security: The approach of managing TLS profiles via ConfigMaps, which are cached locally by Kubelet, is a secure way to distribute sensitive configuration without exposing it directly in pod definitions or requiring repeated API calls.

I have a minor suggestion regarding error handling, though it's not critical:

  • Severity: low
  • Description: In pkg/cluster/cluster.go, the FetchTLSProfileFromFile function logs an error if it fails to read or parse the TLS profile. While the current error messages are informative, consider adding a debug log of the file path or a truncated content (if not sensitive) when parsing fails. This could aid in debugging malformed ConfigMaps in a development environment, but care should be taken not to log sensitive data in production. Given the nature of TLS profiles, the current generic error message is likely safer for production.

Great work on this pull request! The solution is well-thought-out and provides a solid improvement to the system's robustness.

@qinqon qinqon force-pushed the handler-tls-no-apiserver branch from 681a76b to 7986370 Compare March 24, 2026 12:25
@qinqon qinqon requested a review from mkowalski March 24, 2026 12:26
@mkowalski
Copy link
Copy Markdown
Member

/approve

@kubevirt-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 25, 2026
@qinqon qinqon force-pushed the handler-tls-no-apiserver branch 3 times, most recently from 0907f4a to 8789327 Compare March 26, 2026 09:23
The handler binary calls the API server at startup to detect OpenShift
and fetch the TLS security profile. When the default network interface
is being reconfigured (e.g. bond removal), the API server is temporarily
unreachable, causing a 30s timeout, handler crash, and restart loop.

Fix this by:
- Adding IS_OPENSHIFT env var set by the operator on all handler-deployed
  pods, eliminating the IsOpenShift() API discovery call
- On OpenShift, the operator fetches the TLS profile and writes it to a
  ConfigMap. Webhook and metrics pods mount it and read from file instead
  of calling the API server. Kubelet caches ConfigMap volumes locally, so
  they survive container restarts without API server connectivity.
- The handler DaemonSet skips TLS profile loading entirely since it does
  not serve platform-specific TLS endpoints
- TLS profile changes are handled entirely by the operator: it watches
  the APIServer CR, updates the ConfigMap, and rolls webhook/metrics
  Deployments via a hash annotation on the pod template

Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qinqon qinqon force-pushed the handler-tls-no-apiserver branch from 8789327 to 507c8d3 Compare March 26, 2026 09:56
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 28, 2026
@kubevirt-bot
Copy link
Copy Markdown
Collaborator

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants