Skip to content

Commit b95f24c

Browse files
manav-afacebook-github-bot
authored andcommitted
Add a new experimental restart policy for large scale model training (pytorch#922)
Summary: Pull Request resolved: pytorch#922 TSIA Reviewed By: andywag Differential Revision: D58684341
1 parent 5058b6b commit b95f24c

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

torchx/specs/api.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,11 +237,13 @@ class RetryPolicy(str, Enum):
237237
application to deal with failed replica departures and
238238
replacement replica admittance.
239239
2. APPLICATION: Restarts the entire application.
240-
240+
3. APPLICATION_HOT_SPARE: Restarts the replicas for a role as long as quorum (min_replicas)
241+
is not violated using extra hosts as spares. (EXPERIMENTAL)
241242
"""
242243

243244
REPLICA = "REPLICA"
244245
APPLICATION = "APPLICATION"
246+
APPLICATION_HOT_SPARE = "APPLICATION_HOT_SPARE"
245247

246248

247249
class MountType(str, Enum):
@@ -340,6 +342,8 @@ class Role:
340342
and num_replicas depending on the cluster resources and
341343
policies. If the scheduler doesn't support auto scaling this
342344
field is ignored and the job size will be num_replicas.
345+
EXPERIMENTAL: For APPLICATION_HOT_SPARE restart policy this field is used to
346+
indicate the quorum required for the job to run.
343347
max_retries: max number of retries before giving up
344348
retry_policy: retry behavior upon replica failures
345349
resource: Resource requirement for the role. The role should be scheduled

0 commit comments

Comments
 (0)