Skip to content

Commit 0622a97

Browse files
manav-afacebook-github-bot
authored andcommitted
Add a new experimental restart policy for large scale model training
Summary: TSIA Differential Revision: D58684341
1 parent 2ec3673 commit 0622a97

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

torchx/specs/api.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -219,11 +219,13 @@ class RetryPolicy(str, Enum):
219219
application to deal with failed replica departures and
220220
replacement replica admittance.
221221
2. APPLICATION: Restarts the entire application.
222-
222+
3. QUORUM: Restarts the replicas for a role as long as the quorum is not
223+
violated. (EXPERIMENTAL)
223224
"""
224225

225226
REPLICA = "REPLICA"
226227
APPLICATION = "APPLICATION"
228+
QUORUM = "QUORUM"
227229

228230

229231
class MountType(str, Enum):
@@ -322,6 +324,8 @@ class Role:
322324
and num_replicas depending on the cluster resources and
323325
policies. If the scheduler doesn't support auto scaling this
324326
field is ignored and the job size will be num_replicas.
327+
EXPERIMENTAL: For Quorum Restart policy this field is used to indicate the
328+
quorum required for the job to run.
325329
max_retries: max number of retries before giving up
326330
retry_policy: retry behavior upon replica failures
327331
resource: Resource requirement for the role. The role should be scheduled

0 commit comments

Comments
 (0)