Skip to content

Finer control over maxRestarts for RestartJobSet vs RecreateJob #976

@carreter

Description

@carreter

What would you like to be added:
Currently, the RecreateJob and RestartJobSet failure policies both increment the same .status.restartsCountTowardsMax whenever they are triggered. When said counter reaches the value of .spec.failurePolicy.maxRestarts, triggering either the RecreateJob or RestartJobSet will mark the entire JobSet as failed.

It'd be desirable to allow separate counters for RecreateJob and RestartJobSet. For example, imagine a JobSet with three ReplicatedJobs: workers-heavy, workers-light, and controllers. We want the following behaviours:

  • failures in workers-heavy trigger RecreateJob up to 4 times for each child Job, as these are expensive to recreate
  • failures in workers-light trigger RecreateJob up to 10 times for each child Job, as these are easy and fast to recreate
  • failures in controllers trigger RestartJobSet up to 3 times total

The configuration could look something like this

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: test
spec:
  failurePolicy:
    maxRestarts: 3 # How many times can a full JobSet restart happen?
    rules:
    - action: RecreateJob
      maxRecreatesPerIndex: 4 # How many times can a child job in workers-heavy be recreated before failing the JobSet?
      targetReplicatedJobs:
      - workers-heavy
    - action: RecreateJob
      maxRecreatesPerIndex: 10 # How many times can a child job in workers-light be recreated before failing the JobSet?
      targetReplicatedJobs:
      - workers-light

Why is this needed:
Not all ReplicatedJob failures are the same :) In the previous example, workers-heavy take a long time to recreate and as such should fail in fewer attempts than workers-light, which can be recreated quickly more times without incurring too heavy a cost.

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Untriaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions