Skip to main content

Instructions below will not apply to the debugging machine 199.60.17.247 You do not and cannot use slurm on the debugging machine However please be judicious with your core usage (not more than 2-3) on that machine as your might be trampling on yourself and others

First test your worklaod with interactive slurm before submitting batch

Tutorial: Using Slurm Workload Manager

For your assignments, you have been provided access to two parallel servers. Since these servers will be shared by all students, a workload manager called Slurm is set up so that your assignment jobs/processes do not interfere with each other. For example, slurm will make sure that CPUs get allocated to different students’ jobs such that no two processes/jobs operate on (or fight for) the same CPU. In this tutorial, we will show how to work with slurm so that you can submit your processes/jobs and check their results.

Important: Students are supposed to use slurm to run any tests on the servers. Students bypassing/misusing the server system (e.g., directly running experiments instead of via slurm workload mananger) will get zero in their respective assignments.

1. Creating a Job

The first step is to create a job file that can be run by slurm. The job file is simply a bash script with its header containing slurm commands as comments. These slurm commands requests for specific resources from slurm.
An example job file is shown below:

#!/bin/bash
#
#SBATCH --cpus-per-task=4
#SBATCH --time=2:00
#SBATCH --mem=1G

srun /home/$USER/a.out

WARNING: note that /home/$USER/a.out is just the standard executable command you type You can subsitute with any command you typically type in a shell labs and assignments in this course will require you to set M5_PATH and LAB_PATH Make sure you set those prior to invoking srun

For instance for gem5-lab a file would look like this

#!/bin/bash
#
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem=5G
export M5_PATH=/data/gem5-baseline

srun $M5_PATH/build/RISCV/gem5.opt --debug-start=0 --debug-end=60000 --debug-file=trace.out --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple SingleCycle ./benchmarks/hello.riscv.bin

Note that there is no space between # and SBATCH in the above script.

Options:

The above job file requests the following resources from slurm:

Parameter Description
–cpus-per-task=4 Request 4 CPUs for your job. All threads/processes created by your job will run on these 4 CPUs. You can request upto 16 CPUs for a job.
–time=02:00 Request 2 minutes of time for your job (format is MM:SS). Note that your job will be killed if it does not finish execution within 2 minute time limit. You can request up to 10 minutes for a job.
–mem=1G Request 1GB memory for your job. Note that your job will be killed if it uses more than 1GB.

_ Running an interactive jobs _

srun --ntasks-per-node=8 --pty bash
# This will drop you to a slurm prompt
squeue

     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                37    normal      zsh     [userid]  R       0:03      1 ARCH-750
exit
# drops you back in non-slurm shell

See https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#creating-a-job for more details and options.

User Limits on our Servers:

Students are allowed to perform the following tasks. Stepping beyond these limits will result in errors (explained at the bottom of this page).

  • Submit (or queue) upto 2 jobs in slurm

  • Request upto 8 CPUs per job
  • Use upto 8 CPUs in total, i.e., by all running jobs by a single user. For example, you can simultaneously run 2 jobs requesting 4 CPUs each. If you schedule 2 jobs requesting 8 CPUs each, the second job will not run until the first job is completed.
  • Request upto 40GB memory per job. If your job exceeds this limit, it will be killed.
  • Request upto 60 minutes of time per job. If your job exceeds this limit, it will be killed.

2. Submitting a Job

After creating the job file, you need to submit it to slurm so that slurm can schedule it to run on the server. Let’s say our job file is called submit.sh. You can submit this job file as shown below:

    $ sbatch submit.sh
    Submitted batch job 36

Notice that slurm will autmatically print Submitted batch job <jobid> upon running sbatch. In above example, the jobid for our submit.sh is 36.

Upon successful submission, your job will be part of the slurm queue. Whenever enough resources become available for your job, slurm will schedule your job to be executed. When a job is executed, an output file called slurm-<jobid>.out is created in the directory from which sbatch was originally run. All the logs related to the job get appended to this file.

Note: You can queue up to 4 jobs at a time:

    $ sbatch submit.sh
    Submitted batch job 37 
    $ sbatch submit.sh
    Submitted batch job 38
    $ sbatch submit.sh
    Submitted batch job 39
    $ sbatch submit.sh
    sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

If you submit more than 4 jobs at a time you will get the above error. In this case, you should wait for one of your jobs to finish after which you can submit your next job.

3. Check the Job Queue

Recall that sbatch queues your job in the slurm queue. You can check the job queue using the following command:

    $ squeue

To see only your own jobs in the queue, you can run:

    $ squeue --user $USER
                 JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                    38     debug submit.s slurmtes PD       0:00      1 (QOSMaxCpuPerUserLimit)
                    39     debug submit.s slurmtes PD       0:00      1 (QOSMaxCpuPerUserLimit)
                    36     debug submit.s slurmtes  R       0:08      1 cs-cloud-03
                    37     debug submit.s slurmtes  R       0:08      1 cs-cloud-03

Here, jobs 38 and 39 are pending (status: PD) because the maximum CPU limit per user (8 CPUs) has been reached by 36 and 37 (status: R). These jobs will be picked up when the limits can be satisfied.

Your job can be in one of these states:

State Description
CA Cancelled: The job was explicitly cancelled
CD Completed: The job has terminated with an exit code of zero
CG Completing: The job is in the process of completing
F Failed: The job terminated with non-zero exit code or other failure condition
PD Pending: The job is awaiting resource allocation
R Running: The job is currently running
TO Timeout: The job terminated upon reaching its time limit

## 4. Cancelling a Job

You can remove your jobs from the slurm queue. This might be required if you realized that your solution is incorrect, or your job is requesting incorrect resources.

    $ scancel <jobid>

Note: You can cancel only your own jobs.

5. More Information about a Job

The sacct command gives you more information about your job:

    $ sacct -j 37
           JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
    ------------ ---------- ---------- ---------- ---------- ---------- -------- 
    37            submit.sh      debug   students          4  COMPLETED      0:0 
    37.batch          batch              students          4  COMPLETED      0:0 
    37.0              a.out              students          4  COMPLETED      0:0

6. Common Error Messages

  • Upon job submission:
        sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Solution: Make that the resources you request are within the limits set above (jobs queued, memory, CPUs, wall time).

  • In job output:
        slurmstepd-cs-cloud-03: error: Job 40 hit memory limit at least once during execution. This may or may not result in some failure.

This means your job used more memory than requested, and hence was killed. Solution: Request more memory with --mem in your job script.

        slurmstepd-cs-cloud-03: error: *** JOB 41 ON cs-cloud-03 CANCELLED AT 2019-08-27T13:51:31 DUE TO TIME LIMIT ***

This means your job ran for longer than requested, and hence was killed. Solution: Request more time with --time in your job script.