SLURM - Simple Linux Utility for Resource Management
On the Hellgate Research Cluster a work load manager, SLURM, is used. SLURM is a job scheduling system for Linux clusters that allocates resources to users, manages job execution, and handles job queries. The full documentation can be found at: but here are a few items to get you started.
SLURM Commands: Sbatch
Sbatch is used to submit a job script for later execution. It is recommended that sbatch is used as it allows users to more easily specify resource allocation, schedule job times, create error logs, and keep the SLURM job running regardless of the user closing the terminal. Sbatch scripts end with the .sh extension to signify what they are. To run such a script the command used is:
- sbatch <script_name.sh>
Sbatch also has the ability to request multiple jobs using arrays, allowing users to run one script but execute multiple processes. The resource specification will apply to each job.
SLURM Commands: Srun
Srun is used to submit a job fo execution or initiate job states in real time. Srun has a wide variety of options when specifying resource requirements. A job can contain multiple job steps executing sequentially or in parallel on independent or sharded resources within the job's node allocation.
Exiting the session will terminate the SLURM instance. To get around this users can use tmux, a Linux shell that allows users to create pseudo-terminal sessions, to keep the SLURM instance alive, while allowing users to still use the same terminal.
SLURM Commands: Squeue
Squeue reports the state of jobs or job steps. By default it reports the running jobs in priority order and then pending jobs in priority order as well.
- squeue
Users can also have a persistent window that updates every two seconds using watch.
- watch squeue
To see only the jobs the current user has:
- squeue --me
SLURM Commands: Scancel
Scancel is used to cancel a pending or running job, or job step.
- scancel <job_ID>
SLURM Commands: Sacct
Sacct is used to report job or job step accounting information about active or completed jobs.
- sacct --jobs=<job_ID> --
- format=JobID,JobName,Elapsed,systemcpu,total,avecpu,cputime,maxvmsize,maxr
ss
SLURM Commands: Sinfo
Sinfo is used to report the state of partitions and nodes managed by SLURM,
- sinfo
SSHing to Nodes:
On hellgate it is possible to SSH into a node that is currently running a job.
- ssh <node>