SLURM - Simple Linux Utility for Resource Management

student working on a laptop sitting outside leaning against a tree

Information Technology

SLURM - Simple Linux Utility for Resource Management

On the Hellgate Research Cluster a work load manager, SLURM, is used. SLURM is a job scheduling system for Linux clusters that allocates resources to users, manages job execution, and handles job queries. The full documentation can be found at: but here are a few items to get you started.

SLURM Commands: Sbatch

Sbatch is used to submit a job script for later execution. It is recommended that sbatch is used as it allows users to more easily specify resource allocation, schedule job times, create error logs, and keep the SLURM job running regardless of the user closing the terminal. Sbatch scripts end with the .sh extension to signify what they are. To run such a script the command used is:

sbatch <script_name.sh>

Image of an example sbatch file

Sbatch also has the ability to request multiple jobs using arrays, allowing users to run one script but execute multiple processes. The resource specification will apply to each job.

SLURM Commands: Srun

Srun is used to submit a job fo execution or initiate job states in real time. Srun has a wide variety of options when specifying resource requirements. A job can contain multiple job steps executing sequentially or in parallel on independent or sharded resources within the job's node allocation.

Exiting the session will terminate the SLURM instance. To get around this users can use tmux, a Linux shell that allows users to create pseudo-terminal sessions, to keep the SLURM instance alive, while allowing users to still use the same terminal.

SLURM Commands: Squeue

Squeue reports the state of jobs or job steps. By default it reports the running jobs in priority order and then pending jobs in priority order as well.

squeue

Users can also have a persistent window that updates every two seconds using watch.

watch squeue

To see only the jobs the current user has:

squeue --me

SLURM Commands: Scancel

Scancel is used to cancel a pending or running job, or job step.

scancel <job_ID>

SLURM Commands: Sacct

Sacct is used to report job or job step accounting information about active or completed jobs.

sacct --jobs=<job_ID> --
format=JobID,JobName,Elapsed,systemcpu,total,avecpu,cputime,maxvmsize,maxr
ss

SLURM Commands: Sinfo

Sinfo is used to report the state of partitions and nodes managed by SLURM,

sinfo

SSHing to Nodes:

On hellgate it is possible to SSH into a node that is currently running a job.

ssh <node>

老虎机攻略

Information Technology