-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Normal day-to-day usage of make deploy/undeploy cycles for bpfman-operator development causes exponential growth in mount entries within the cluster control plane, eventually rendering the cluster unusable.
Cycle 1 - Current mount count: 80
Cycle 2 - Current mount count: 82
Cycle 3 - Current mount count: 86
Cycle 4 - Current mount count: 95
Cycle 5 - Current mount count: 111
Cycle 6 - Current mount count: 143
Cycle 7 - Current mount count: 207
Cycle 8 - Current mount count: 335
Cycle 9 - Current mount count: 591
Cycle 10 - Current mount count: 1103
Cycle 11 - Current mount count: 2127
Cycle 12 - Current mount count: 4175
Cycle 13 - Current mount count: 8271
Cycle 14 - Current mount count: 16463
Cycle 15 - Current mount count: 32847
The growth follows an exponential pattern, approximately doubling each cycle after cycle 7. This indicates mount points are not being properly cleaned up during undeploy operations.
Impact
- Development Workflow Disruption: Clusters become unusable after routine deploy/undeploy cycles
- System Instability: High mount counts can cause kernel resource exhaustion
- Deployment Failures: Eventually pods fail to enter running state within timeout periods
- Performance Degradation: Mount table operations become increasingly slow
Reproduction Steps
- Create a Kind cluster:
make setup-kind - Monitor mount points in real-time (optional):
while :; do echo -n "$(date): "; docker exec bpfman-deployment-control-plane wc -l /proc/mounts sleep 5 done
- Perform normal development cycles:
make deploymake undeploy- Repeat
- Observe mount count growth with each cycle
- After ~15 cycles, deployments timeout waiting for pods to become ready
Environment
- Platform: KIND cluster on Linux (or OpenShift)
- Cluster: bpfman-deployment (default KIND cluster name)
- Namespace: bpfman
- Development Cycles: 15 deploy/undeploy cycles over ~9 minutes
Note: This issue is also reproducible on OpenShift clusters, where remediation requires rebooting affected nodes.
Expected Behaviour
Mount entries should return to baseline levels after each undeploy operation, with minimal growth over time.
Actual Behaviour
Mount entries accumulate exponentially, never being fully cleaned up during undeploy operations, eventually breaking the development cluster.
This issue significantly impacts the development workflow as clusters become unusable after routine development cycles. After ~15 deploy/undeploy cycles, the cluster reaches system limits (32,767 mount entries) and deployments begin failing.
Production Impact: On OpenShift clusters, this issue manifests similarly and requires node reboots to remediate, making it a critical issue for production deployments.
The issue likely stems from incomplete cleanup during the undeploy process, where mount points created during deployment are not properly removed.
Suggested Investigation
An initial avenue of investigation would be the CSI driver, as it manages volume mounts and may not be properly cleaning up mount points during undeploy operations. The exponential growth pattern suggests mount points are being created but never removed.