Instructions below will not apply to the debugging machine 199.60.17.247
You do not and cannot use slurm on the debugging machine
However please be judicious with your core usage (not more than 2-3) on that machine as your might be trampling on
yourself and others
First test your worklaod with interactive slurm before submitting batch
For your assignments, you have been provided access to two parallel servers. Since these servers will be shared by all students, a workload manager called Slurm
is set up so that your assignment jobs/processes do not interfere with each other. For example, slurm will make sure that CPUs get allocated to different students’ jobs such that no two processes/jobs operate on (or fight for) the same CPU. In this tutorial, we will show how to work with slurm so that you can submit your processes/jobs and check their results.
Important: Students are supposed to use slurm to run any tests on the servers. Students bypassing/misusing the server system (e.g., directly running experiments instead of via slurm workload mananger) will get zero in their respective assignments.
The first step is to create a job file that can be run by slurm. The job file is simply a bash script with its header containing slurm commands as comments. These slurm commands requests for specific resources from slurm.
An example job file is shown below:
#!/bin/bash
#
#SBATCH --cpus-per-task=4
#SBATCH --time=2:00
#SBATCH --mem=1G
srun /home/$USER/a.out
WARNING: note that /home/$USER/a.out is just the standard executable command you type
You can subsitute with any command you typically type in a shell
labs and assignments in this course will require you to set M5_PATH and LAB_PATH
Make sure you set those prior to invoking srun
For instance for gem5-lab a file would look like this
#!/bin/bash
#
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem=5G
export M5_PATH=/data/gem5-baseline
srun $M5_PATH/build/RISCV/gem5.opt --debug-start=0 --debug-end=60000 --debug-file=trace.out --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple SingleCycle ./benchmarks/hello.riscv.bin
Note that there is no space between #
and SBATCH
in the above script.
The above job file requests the following resources from slurm:
Parameter | Description |
---|---|
–cpus-per-task=4 | Request 4 CPUs for your job. All threads/processes created by your job will run on these 4 CPUs. You can request upto 16 CPUs for a job. |
–time=02:00 | Request 2 minutes of time for your job (format is MM:SS). Note that your job will be killed if it does not finish execution within 2 minute time limit. You can request up to 10 minutes for a job. |
–mem=1G | Request 1GB memory for your job. Note that your job will be killed if it uses more than 1GB. |
_ Running an interactive jobs _
srun --ntasks-per-node=8 --pty bash
# This will drop you to a slurm prompt
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
37 normal zsh [userid] R 0:03 1 ARCH-750
exit
# drops you back in non-slurm shell
See https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#creating-a-job for more details and options.
Students are allowed to perform the following tasks. Stepping beyond these limits will result in errors (explained at the bottom of this page).
Submit (or queue) upto 2 jobs in slurm
After creating the job file, you need to submit it to slurm so that slurm can schedule it to run on the server. Let’s say our job file is called submit.sh
. You can submit this job file as shown below:
$ sbatch submit.sh
Submitted batch job 36
Notice that slurm will autmatically print Submitted batch job <jobid>
upon running sbatch
. In above example, the jobid for our submit.sh
is 36.
Upon successful submission, your job will be part of the slurm queue. Whenever enough resources become available for your job, slurm will schedule your job to be executed. When a job is executed, an output file called slurm-<jobid>.out
is created in the directory from which sbatch
was originally run. All the logs related to the job get appended to this file.
Note: You can queue up to 4 jobs at a time:
$ sbatch submit.sh
Submitted batch job 37
$ sbatch submit.sh
Submitted batch job 38
$ sbatch submit.sh
Submitted batch job 39
$ sbatch submit.sh
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
If you submit more than 4 jobs at a time you will get the above error. In this case, you should wait for one of your jobs to finish after which you can submit your next job.
Recall that sbatch
queues your job in the slurm queue. You can check the job queue using the following command:
$ squeue
To see only your own jobs in the queue, you can run:
$ squeue --user $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
38 debug submit.s slurmtes PD 0:00 1 (QOSMaxCpuPerUserLimit)
39 debug submit.s slurmtes PD 0:00 1 (QOSMaxCpuPerUserLimit)
36 debug submit.s slurmtes R 0:08 1 cs-cloud-03
37 debug submit.s slurmtes R 0:08 1 cs-cloud-03
Here, jobs 38 and 39 are pending (status: PD) because the maximum CPU limit per user (8 CPUs) has been reached by 36 and 37 (status: R). These jobs will be picked up when the limits can be satisfied.
Your job can be in one of these states:
State | Description |
---|---|
CA | Cancelled: The job was explicitly cancelled |
CD | Completed: The job has terminated with an exit code of zero |
CG | Completing: The job is in the process of completing |
F | Failed: The job terminated with non-zero exit code or other failure condition |
PD | Pending: The job is awaiting resource allocation |
R | Running: The job is currently running |
TO | Timeout: The job terminated upon reaching its time limit |
## 4. Cancelling a Job
You can remove your jobs from the slurm queue. This might be required if you realized that your solution is incorrect, or your job is requesting incorrect resources.
$ scancel <jobid>
Note: You can cancel only your own jobs.
The sacct
command gives you more information about your job:
$ sacct -j 37
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
37 submit.sh debug students 4 COMPLETED 0:0
37.batch batch students 4 COMPLETED 0:0
37.0 a.out students 4 COMPLETED 0:0
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Solution: Make that the resources you request are within the limits set above (jobs queued, memory, CPUs, wall time).
slurmstepd-cs-cloud-03: error: Job 40 hit memory limit at least once during execution. This may or may not result in some failure.
This means your job used more memory than requested, and hence was killed. Solution: Request more memory with --mem
in your job script.
slurmstepd-cs-cloud-03: error: *** JOB 41 ON cs-cloud-03 CANCELLED AT 2019-08-27T13:51:31 DUE TO TIME LIMIT ***
This means your job ran for longer than requested, and hence was killed. Solution: Request more time with --time
in your job script.