-
Notifications
You must be signed in to change notification settings - Fork 91
Description
What would you like to be added:
Currently, the RecreateJob
and RestartJobSet
failure policies both increment the same .status.restartsCountTowardsMax
whenever they are triggered. When said counter reaches the value of .spec.failurePolicy.maxRestarts
, triggering either the RecreateJob
or RestartJobSet
will mark the entire JobSet as failed.
It'd be desirable to allow separate counters for RecreateJob
and RestartJobSet
. For example, imagine a JobSet with three ReplicatedJobs: workers-heavy
, workers-light
, and controllers
. We want the following behaviours:
- failures in
workers-heavy
triggerRecreateJob
up to 4 times for each childJob
, as these are expensive to recreate - failures in
workers-light
triggerRecreateJob
up to 10 times for each childJob
, as these are easy and fast to recreate - failures in
controllers
triggerRestartJobSet
up to 3 times total
The configuration could look something like this
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: test
spec:
failurePolicy:
maxRestarts: 3 # How many times can a full JobSet restart happen?
rules:
- action: RecreateJob
maxRecreatesPerIndex: 4 # How many times can a child job in workers-heavy be recreated before failing the JobSet?
targetReplicatedJobs:
- workers-heavy
- action: RecreateJob
maxRecreatesPerIndex: 10 # How many times can a child job in workers-light be recreated before failing the JobSet?
targetReplicatedJobs:
- workers-light
Why is this needed:
Not all ReplicatedJob failures are the same :) In the previous example, workers-heavy
take a long time to recreate and as such should fail in fewer attempts than workers-light
, which can be recreated quickly more times without incurring too heavy a cost.
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status