Description
Co-Authored by @nilsfed
What steps did you take and what happened:
On our clusters we use gatekeeper with an AssignImage
mutation to rewrite Pod
images to use our private mirror. On about 2% of pods this rewrite does not
happen when rolling over our node pools.
We run Gatekeeper in a failing-open configuration with three
gatekeeper-controller-manager replicas.
We investigated this by installing gatekeeper in a controlled environment
(minikube) and used curl to query the webhook endpoint in a loop as fast as
possible and recording failures. Our test setup is outlined below. On scaling
events we observed failing requests. We root caused this to the following two
problems in gatekeeper:
- On gatekeeper pod startup the readiness probe indicates a pod is ready but
unable to serve webhook traffic - On gatekeeper pod termination the service still points to the terminating pod
for a brief moment due to Services updating asynchronously. Gatekeeper does
not have a grace period on server shutdown, leading to refused connections.
What did you expect to happen:
Gatekeeper pods can receive requests when they are registered endpoints at the
service.
Mitigations:
We found that the health- and readinessprobes are misconfigured. They indicate a
ready-state as soon as the manager is started, even though the webhook is not
responding to requests yet.
While we can reconfigure the health probe to validate that the webhook server is
able to serve requests by passing --enable-tls-healthcheck=true
to the
gatekeeper-controller-manager, this is not possible yet for the readiness probe.
If webhooks are enabled the readiness probe should check actual service health.
So we propose to implement the behavior of --enable-tls-healthcheck=true
for
the readiness probe as well and enable it by default for the readiness probe
only.
We also found that adding a preStopHook to the gatekeeper-controller-manager
further prevents failing requests due to the webhook server terminating before
the endpoint gets removed from the K8s service.
Both mitigations yield zero failed requests over a test time frame of 30 minutes
with the test setup outlined below. Without the mitigations we saw requests
failing after less than a minute.
Anything else you would like to add:
Our test setup:
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.17.1/deploy/gatekeeper.yaml
kubectl apply -f deployment.yaml
kubectl exec -it deployment/test -- bash
# In test container
request_count=0
fail_count=0
while :
do
if ! curl --retry 0 --connect-timeout 1 --insecure https://gatekeeper-webhook-service.gatekeeper-system:443/v1/mutate; then
echo fail
fail_count=$((fail_count+1))
fi
request_count=$((request_count+1))
echo failed: $fail_count
echo request_count: $request_count
done
# In seperate terminal on host:
while :
do
kubectl rollout restart -n gatekeeper-system deployment gatekeeper-controller-manager
sleep 20
done
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- name: test
image: fedora:latest
command:
- sleep
- infinity
resources:
limits:
memory: "128Mi"
We have attached the kustomization we use to work around this problem, until the
final fix is upstream, here. It is a hack, which enables tls health checks and
uses the /healthz endpoint for readiness probes and adds the termination
preStopHook to buy the service time to disconnect the service and finish
in-flight requests.
Kustomization
kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: gatekeeper-system
resources:
- https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.17.1/deploy/gatekeeper.yaml
patches:
- path: ./hotfix.yaml
target:
kind: Deployment
hotfix.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gatekeeper-controller-manager
namespace: gatekeeper-system
spec:
template:
spec:
containers:
- name: manager
args:
- --port=8443
- --logtostderr
- --exempt-namespace=gatekeeper-system
- --operation=webhook
- --operation=mutation-webhook
- --disable-opa-builtin={http.send}
- --enable-tls-healthcheck=true
readinessProbe:
httpGet:
path: /healthz
lifecycle:
preStop:
sleep: 10
Environment:
- Gatekeeper version: 1.17.1
- Kubernetes version: (use
kubectl version
): 1.31.0