Skip to content

slurm: support clusters without sacct #1070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2025
Merged

slurm: support clusters without sacct #1070

merged 1 commit into from
May 22, 2025

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented May 22, 2025

Many bare-bones slurm clusters aren't configured with accounting enabled (sacct) as it requires an external SQL database. This makes torchx slurm scheduler list/describe/status methods fallback to squeue which will show currently running jobs.

It also shows the job name in torchx list as slurm job_ids are just numbers

Test plan:

pytest torchx/schedulers/test/slurm_scheduler_test.py
(pytorch-3.12) ubuntu@slurm-head-node-0:~/tristanr/runner$ torchx status slurm://torchx/163
Slurm accounting storage is disabled
AppStatus:
    State: RUNNING
    Num Restarts: -1
    Roles: 
 *train[0]:RUNNING
    Msg: RUNNING
    Structured Error Msg: <NONE>
    UI URL: None
    
(pytorch-3.12) ubuntu@slurm-head-node-0:~/tristanr/runner$ torchx list -s slurm
Slurm accounting storage is disabled
APP HANDLE          APP NAME    APP STATUS
------------------  ----------  ------------
slurm://torchx/163  train-0     RUNNING

@d4l3k d4l3k requested a review from kiukchung May 22, 2025 18:58
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2025
@facebook-github-bot
Copy link
Contributor

@kiukchung has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kiukchung kiukchung merged commit 24dc0d5 into main May 22, 2025
22 checks passed
@kiukchung kiukchung deleted the d4l3k/slurm_squeue branch May 22, 2025 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants