Skip to content

Commit 352da37

Browse files
manav-afacebook-github-bot
authored andcommitted
Remove experimental hot spare policy (#948)
Summary: Pull Request resolved: #948 This diff removes the experimental hot spare restart policy and uses role restart instead with quorum hosts set to the min nodes requirement. This isnt the best as this gives us no good way to differentiate between elasticity and quorum based restarts in the future but we can address this by supporting quorum restarts differently in the future. Differential Revision: D61746435
1 parent bfce4bd commit 352da37

File tree

2 files changed

+2
-10
lines changed

2 files changed

+2
-10
lines changed

torchx/specs/api.py

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -237,17 +237,12 @@ class RetryPolicy(str, Enum):
237237
application to deal with failed replica departures and
238238
replacement replica admittance.
239239
2. APPLICATION: Restarts the entire application.
240-
3. HOT_SPARE: Restarts the replicas for a role as long as quorum (min_replicas)
241-
is not violated using extra hosts as spares. It does not really support
242-
elasticity and just uses the delta between num_replicas and min_replicas
243-
as spares (EXPERIMENTAL).
244-
4. ROLE: Restarts the role when any error occurs in that role. This does not
240+
3. ROLE: Restarts the role when any error occurs in that role. This does not
245241
restart the whole job.
246242
"""
247243

248244
REPLICA = "REPLICA"
249245
APPLICATION = "APPLICATION"
250-
HOT_SPARE = "HOT_SPARE"
251246
ROLE = "ROLE"
252247

253248

@@ -347,8 +342,6 @@ class Role:
347342
and num_replicas depending on the cluster resources and
348343
policies. If the scheduler doesn't support auto scaling this
349344
field is ignored and the job size will be num_replicas.
350-
EXPERIMENTAL: For HOT_SPARE restart policy this field is used to
351-
indicate the quorum required for the job to run.
352345
max_retries: max number of retries before giving up
353346
retry_policy: retry behavior upon replica failures
354347
resource: Resource requirement for the role. The role should be scheduled

torchx/specs/test/api_test.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,6 @@ def test_retry_policies(self) -> None:
273273
RetryPolicy.APPLICATION,
274274
RetryPolicy.REPLICA,
275275
RetryPolicy.ROLE,
276-
RetryPolicy.HOT_SPARE,
277276
},
278277
)
279278

@@ -494,7 +493,7 @@ def test_resolve_from_str(self) -> None:
494493
"foo=bar,test_key=test_value,default_time=42,enable=True,disable=False,complex_list=v1;v2;v3"
495494
)
496495
),
497-
),
496+
)
498497

499498
def test_config_from_json_repr(self) -> None:
500499
opts = runopts()

0 commit comments

Comments
 (0)