Skip to content

Commit cb1fec1

Browse files
authored
Add a new experimental restart policy for large scale model training
Differential Revision: D58684341 Pull Request resolved: #922
1 parent 7bfa26d commit cb1fec1

File tree

1 file changed

+7
-1
lines changed

1 file changed

+7
-1
lines changed

torchx/specs/api.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,11 +237,15 @@ class RetryPolicy(str, Enum):
237237
application to deal with failed replica departures and
238238
replacement replica admittance.
239239
2. APPLICATION: Restarts the entire application.
240-
240+
3. HOT_SPARE: Restarts the replicas for a role as long as quorum (min_replicas)
241+
is not violated using extra hosts as spares. It does not really support
242+
elasticity and just uses the delta between num_replicas and min_replicas
243+
as spares (EXPERIMENTAL).
241244
"""
242245

243246
REPLICA = "REPLICA"
244247
APPLICATION = "APPLICATION"
248+
HOT_SPARE = "HOT_SPARE"
245249

246250

247251
class MountType(str, Enum):
@@ -340,6 +344,8 @@ class Role:
340344
and num_replicas depending on the cluster resources and
341345
policies. If the scheduler doesn't support auto scaling this
342346
field is ignored and the job size will be num_replicas.
347+
EXPERIMENTAL: For HOT_SPARE restart policy this field is used to
348+
indicate the quorum required for the job to run.
343349
max_retries: max number of retries before giving up
344350
retry_policy: retry behavior upon replica failures
345351
resource: Resource requirement for the role. The role should be scheduled

0 commit comments

Comments
 (0)