Open
Description
We recently merged the initial iteration of the in-memory provider (#8799). But this was just the first step of the scale test implementation. This issue provides an overview over ongoing and upcoming tasks around scale testing.
In-Memory provider features:
- High-level:
- P0 ClusterClass support (🌱 add ClusterClass support for in-memory provider #8807 @ykakarap)
- P0 Deletion (🐛 fix cluster deletion in the in-memory API server #8818 @fabriziopandini)
- Upgrade (@killianmuldoon)
- KCP kube-proxy and CoreDNS reconciliation (🌱 CAPIM: Enable update for coreDNS and kube-proxy #8899 @killianmuldoon)
- Make it behave like a real infra provider:
- P0 Provisioning duration (🌱 Add startup timeout to the in memory provider #8831 @fabriziopandini)
- Errors
- Configurable apiserver/etcd latency
- Low-level:
- P0 apiserver: watches (🌱 Add watch to in-memory server multiplexer #8851 @killianmuldoon)
- apiserver: label selector for list calls
- Not a problem for cached resources KCP. Label selectors for cached resources are evaluated client-side in CR.
- apiserver: improve field selector calls. Return an error if the field selector is not supported (✨ Enable Kubernetes upgrades in CAPIM #8938 (comment) @killianmuldoon )
e2e test and test framework:
- Implement scale test automation:
- Cluster topologies:
- Small workload cluster: x * (1 control-plane + 1 worker node)
- Small medium workload cluster: x * (3 control-plane + 10 worker node)
- Medium Workload Cluster: x * (3 control-plane + 50 worker nodes)
- Large Workload Cluster: x * (3 control-plane + 500 worker nodes)
- Dimensions: # of MachineDeployments
- Scenarios:
- P0 Create & delete (🌱 Add Scale e2e - development only #8833 @ykakarap)
- Create, upgrade & delete @killianmuldoon
- Long-lived clusters (~ a few hours or a day, to catch memory leaks etc.)
- Chaos testing: e.g. injecting failures like cluster not reachable, machine failures
- More complex scenarios: e.g. topology is actively changed (MD scale up etc.)
- Add MachineHealthCheck to the scaling test (@ykakarap)
- Cluster topologies:
- Automate scale testing in CI: (prior art KCP, k/k)
- Metric collection and consumption after test completion
- Test should fail based on SLA's (e.g. machine creation slower than x minutes)
Metrics & observability:
- P0 Cluster API state metrics & dashboard: (🌱 hack/observability: Add Grafana state dashboard, improve metrics #8834 @sbueringer)
- In-memory provider metrics & dashboard:
- apiserver & etcd: server-side request metrics (prior art: kube-apiserver)
- Consider exposing more metrics in core CAPI, e.g.:
- time until Machine is running
- queue additions (to figure out who is doing it)
- Consider writing alerts for problematic conditions
Performance improvements
- ✨ Add flags for configuring rate limits #8579
- 🐛 Prevent KCP to create many private keys for each reconcile #8617
- 🌱 Use ClusterCacheTracker consistently (instead of NewClusterClient) #8744
- 🌱 Remove unnecessary requeues #8743
- 🐛 ClusterCacheTracker: Stop pod caching when checking workload cluster #8850
- 🌱 Deprecate DefaultIndex usage and remove where not needed #8855
- 🌱 Use rest config from ClusterCacheTracker consistently #8894
- 🌱 optimize
reconcileInterruptibleNodeLabel
of machine controller #8852 - 🌱 controller/machine: use unstructured caching client #8896
- ✨ Use caching read for bootstrap config owner #8867
- 🌱 Kcp use one workload cluster for reconcile #8900
- 🌱 KCP: drop redundant get machines #8912
- 🌱 KCP: cache unstructured #8913
- 🌱 Cache unstructured in Cluster, MD and MS controller #8916
- 🌱 util: cache list calls in cluster to objects mapper #8918
- 🌱 cluster/topology: use cached MD list in get current state #8922
- 🌱 KCP: cache secrets between LookupOrGenerate and ensureCertificatesOwnerRef #8926
- 🌱 all: Add flags to enable block profiling #8934
- 🌱 cluster/topology: use cached Cluster get in Reconcile #8936
- 🌱 cache secrets in KCP, CABPK and ClusterCacheTracker #8940
- Speed up provisioning of the first set of worker machines by improving predicates on cluster watch #8835
- Watches on remote cluster expires every 10s #8893
Follow-up
Anomalies found that we should further triage:
- /convert gets called a lot (even though we never use old apiVersions)
- When deploying > 1k clusters into a namespace "list machines" in KCP becomes pretty slow and apiserver CPU usage was very high (8-14 CPUs) (Debug ideas: cpu profile, apiserver tracing)
Backlog improvement ideas:
- KCP:
- (breaking change): create issue that all KCP secrets must have cluster-name label => then configure KCP cache & client to only cache secrets with cluster-name label
- EnsureResource: Resources are cached atm. Consider only caching PartialObjectMeta instead.
- Consider caching the pods we care about (at least CP, check if we access other pods, kube-proxy, core-dns)
- GetMachinesForCluster: cached call + wait for cache safeguards
- Optimize etcd client creation (cache instead of recreate)
- Others:
- Change all CAPI controllers to cache unstructured per default, use APIReader for uncached calls (like for regular typed objects)
- Audit all usages of APIReader if they are actually necessary
- Run certain operations less frequently (e.g. apiVersion bump, reconcile labels)
- Customize controller work queue rate-limiter
- Buffered reconciling (avoid frequent reconcile of the same item within a short period of time)
- Resync items over time instead of all at once at resyncPeriod
- Investigate if a Reconciler re-reconciles all objects for every type it is watching (because resync is implemented on the informer level), e.g. KCP controller reconciles aver the KCP and after the Cluster resync.
- Priority queue
- Use CR transform option to strip parts of objects we don't use (fields which are not part of the contract)
- trade-off: memory vs. processing time to strip fields, also not sure how to configure up front before we know the CRDs
- => Based on data we don't know if it's worth it at the moment, so we won't do it for now.
Metadata
Metadata
Assignees
Labels
Issues or PRs related to e2e testingCategorizes issue or PR as related to cleaning up code, process, or technical debt.Higher priority than priority/awaiting-more-evidence.Important over the long term, but may not be staffed and/or may need multiple releases to complete.Indicates an issue or PR is ready to be actively worked on.