Running Jobs

Accounting

The charge unit for TGI RAILS is the Service Unit (SU). This corresponds to the equivalent use of one compute core utilizing less than or equal to 2G of memory for one hour. Keep in mind that your charges are based on the resources that are reserved for your job and don’t necessarily reflect how the resources are used. Charges are based on either the number of cores or the fraction of the memory requested, whichever is larger.

Node Type		Service Unit Equivalence	GPUs	Host Memory
CPU core		1	N/A	5.3 GB
CPU Node		96	N/A	512 GB
GPU Node	One H100	120	1 H100	256 GB
GPU Node	8-way H100	960	8 H100	2048 GB

Local Account Charging

Use the accounts command to list the accounts available for charging. Users start out with a single chargable project but may be added to other projects over time.

$ accounts
Project Summary for User 'kingda':

Project         Description                    Balance (Hours)    Deposited (Hours)
--------------  ---------------------------  -----------------  -------------------
bbka-tgirails   TGI RAILS staff allocation            5000000              5000000

Job Accounting Considerations

A node-exclusive job that runs on a compute node for one hour will be charged 96 SUs (96 cores x 1 hour)
A node-exclusive job that runs on a 8-way GPU node for one hour will be charge 960 SUs (8 GPU x 1 hour)

QOSGrpBillingMinutes

If you see QOSGrpBillingMinutes under the Reason column for the squeue command, as in

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1204221       cpu    myjob     .... PD       0:00      5 (QOSGrpBillingMinutes)

then the resource allocation specified for the job (i.e. xyzt-tgirails ) does not have sufficient balance to run the job based on the # of resources requested and the wallclock time. Sometimes it maybe other jobs from the same project that in the same QOSGrpBillingMinutes state are could cause other jobs using the same resource allocation that are preventing a job that would “fit” from running. The PI of the project needs to put in a supplement request using the XRAS proposal system.

Reviewing job charges for a project ( jobcharge )

jobcharge in /sw/user/scripts/ will show job charges by user for a project. Example usage:

[arnoldg@dt-login03 ]$ jobcharge bbka-delta-gpu -b 10 --detail | tail -15
106  1662443  gpuMI100x8                 0  nan                                                               kingda           bash                                    2023-04-06T09:39:01              0       0
107  1662444  gpuMI100x8               291  billing=1000,cpu=2,gres/gpu:mi100=1,gres/gpu=1,mem=3G,node=1      kingda           bash                                    2023-04-06T09:44:11           1000       0.08
108  1662449  gpuMI100x8               614  billing=1000,cpu=2,gres/gpu:mi100=1,gres/gpu=1,mem=3G,node=1      kingda           bash                                    2023-04-06T10:07:23           1000       0.17
109  1662477  gpuMI100x8               446  billing=1000,cpu=2,gres/gpu:mi100=1,gres/gpu=1,mem=3G,node=1      kingda           bash                                    2023-04-06T10:15:08           1000       0.12
110  1662492  gpuMI100x8               760  billing=8000,cpu=2,gres/gpu:mi100=8,gres/gpu=8,mem=3G,node=1      kingda           bash                                    2023-04-06T10:28:00           8000       1.69
111  1662511  gpuMI100x8-interactive  1521  billing=16000,cpu=128,gres/gpu:mi100=8,gres/gpu=8,mem=64G,node=1  arnoldg          bash                                    Unknown                      16000       6.76
_____SUMMARY___________________
User               Charge (SU)
---------------  -------------
arnoldg                  25.76
babreu                    6.66
kingda                    2.06
rmokos                    0.96
svcdeltajenkins           0.23
Total                    35.67

[arnoldg@dt-login03 scripts]$ jobcharge bbka-delta-gpu -b 10
Output for 2023-03-27-11:25:24 through 2023-04-06-11:25:24:
User               Charge (SU)
---------------  -------------
arnoldg                  26.04
babreu                    6.66
kingda                    2.06
rmokos                    0.96
svcdeltajenkins           0.23
Total                    35.95

[arnoldg@dt-login03 ]$ jobcharge bbka-delta-gpu -h
usage: jobcharge [-h] [-m MONTH] [-y YEAR] [-b DAYSBACK] [-s STARTTIME] [-e ENDTIME] [--detail]
                 accountstring

positional arguments:
  accountstring         account name

optional arguments:
  -h, --help            show this help message and exit
  -m MONTH, --month MONTH
                        Month (1-12) Default is current month
  -y YEAR, --year YEAR  Year (20XX) default is current year
  -b DAYSBACK, --daysback DAYSBACK
                        Number of days back
  -s STARTTIME, --starttime STARTTIME
                        Start time string in format (format: %Y-%m-%d-%H:%M:%S)
                        Example:2023-01-03-01:23:21)
  -e ENDTIME, --endtime ENDTIME
                        End time time string in format (format: %Y-%m-%d-%H:%M:%S)
                        Example:2023-01-03-01:23:21)
  --detail              detail output, per-job [svchydroswmanage@hydrol1 scripts]$

Performance tools

[was a link to NVIDIA Nsight Systems wiki page]

Sample Scripts

Serial jobs on CPU nodes

$ cat job.slurm
#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4    # <- match to OMP_NUM_THREADS
#SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuH100x8
#SBATCH --account=account_name
#SBATCH --job-name=myjobtest
#SBATCH --time=00:10:00      # hh:mm:ss for the job
### GPU options ###
##SBATCH --gpus-per-node=2
##SBATCH --gpu-bind=none     # <- or closest
##SBATCH [email protected]
##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options


module reset # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)
             # $WORK and $SCRATCH are now set
module load python  # ... or any appropriate modules
module list  # job documentation and metadata
echo "job is starting on `hostname`"
srun python3 myprog.py

MPI on CPU nodes

#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=2    # <- match to OMP_NUM_THREADS
#SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuH100x8
#SBATCH --account=account_name
#SBATCH --job-name=mympi
#SBATCH --time=00:10:00      # hh:mm:ss for the job
### GPU options ###
##SBATCH --gpus-per-node=2
##SBATCH --gpu-bind=none     # <- or closest ##SBATCH [email protected]
##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options

module reset # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)
             # $WORK and $SCRATCH are now set
module load gcc/11.2.0 openmpi  # ... or any appropriate modules
module list  # job documentation and metadata
echo "job is starting on `hostname`"
srun osu_reduce

OpenMP on CPU nodes

#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32   # <- match to OMP_NUM_THREADS
#SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuH100x8
#SBATCH --account=account_name
#SBATCH --job-name=myopenmp
#SBATCH --time=00:10:00      # hh:mm:ss for the job
### GPU options ###
##SBATCH --gpus-per-node=2
##SBATCH --gpu-bind=none     # <- or closest
##SBATCH [email protected]
##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options

module reset # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)
             # $WORK and $SCRATCH are now set
module load gcc/11.2.0  # ... or any appropriate modules
module list  # job documentation and metadata
echo "job is starting on `hostname`"
export OMP_NUM_THREADS=32
srun stream_gcc

Hybrid (MPI + OpenMP or MPI+X) on CPU nodes

#!/bin/bash
#SBATCH --mem=16g
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4    # <- match to OMP_NUM_THREADS
#SBATCH --partition=cpu      # <- or one of: gpuA100x4 gpuH100x8
#SBATCH --account=account_name
#SBATCH --job-name=mympi+x
#SBATCH --time=00:10:00      # hh:mm:ss for the job
### GPU options ###
##SBATCH --gpus-per-node=2
##SBATCH --gpu-bind=none     # <- or closest
##SBATCH [email protected]
##SBATCH --mail-type="BEGIN,END" See sbatch or srun man pages for more email options

module reset # drop modules and explicitly load the ones needed
             # (good job metadata and reproducibility)
             # $WORK and $SCRATCH are now set
module load gcc/11.2.0 openmpi # ... or any appropriate modules
module list  # job documentation and metadata
echo "job is starting on `hostname`"
export OMP_NUM_THREADS=4
srun xthi

4 gpus together on a compute node

#!/bin/bash
#SBATCH --job-name="a.out_symmetric"
#SBATCH --output="a.out.%j.%N.out"
#SBATCH --partition=gpuH100x8
#SBATCH --mem=208G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4  # could be 1 for py-torch
#SBATCH --cpus-per-task=16   # spread out to use 1 core per numa, set to 64 if tasks is 1
#SBATCH --constraint="scratch"
#SBATCH --gpus-per-node=4
#SBATCH --gpu-bind=closest   # select a cpu close to gpu on pci bus topology
#SBATCH --account=bbjw-tgirails
#SBATCH --exclusive  # dedicated node for this job
#SBATCH --no-requeue
#SBATCH -t 04:00:00

export OMP_NUM_THREADS=1  # if code is not multithreaded, otherwise set to 8 or 16
srun -N 1 -n 4 ./a.out > myjob.out
# py-torch example, --ntasks-per-node=1 --cpus-per-task=64
# srun python3 multiple_gpu.py

Parametric / Array / HTC jobs

Interactive Sessions

Interactive sessions can be implemented in several ways depending on what is needed.

To start up a bash shell terminal on a cpu or gpu node

single core with 16GB of memory, with one task on a cpu node

srun --account=account_name --partition=cpu-interactive \
  --nodes=1 --tasks=1 --tasks-per-node=1 \
  --cpus-per-task=4 --mem=16g \
  --pty bash

single core with 20GB of memory, with one task on a A40 gpu node

srun --account=account_name --partition=gpuA40x4-interactive \
  --nodes=1 --gpus-per-node=1 --tasks=1 \
  --tasks-per-node=16 --cpus-per-task=1 --mem=20g \
  --pty bash

MPI interactive jobs: use salloc followed by srun

Since interactive jobs are already a child process of srun, one cannot srun (or mpirun) applications from within them. Within standard batch jobs submitted via sbatch, use srun to launch MPI codes. For true interactive MPI, use salloc in place of srun shown above, then “srun my_mpi.exe” after you get a prompt from salloc ( exit to end the salloc interactive allocation).

[arnoldg@dt-login01 collective]$ cat osu_reduce.salloc
salloc --account=bbka-delta-cpu --partition=cpu-interactive \
  --nodes=2 --tasks-per-node=4 \
  --cpus-per-task=2 --mem=0

[arnoldg@dt-login01 collective]$ ./osu_reduce.salloc
salloc: Pending job allocation 1180009
salloc: job 1180009 queued and waiting for resources
salloc: job 1180009 has been allocated resources
salloc: Granted job allocation 1180009
salloc: Waiting for resource configuration
salloc: Nodes cn[009-010] are ready for job
[arnoldg@dt-login01 collective]$ srun osu_reduce

# OSU MPI Reduce Latency Test v5.9
# Size       Avg Latency(us)
4                       1.76
8                       1.70
16                      1.72
32                      1.80
64                      2.06
128                     2.00
256                     2.29
512                     2.39
1024                    2.66
2048                    3.29
4096                    4.24
8192                    2.36
16384                   3.91
32768                   6.37
65536                  10.49
131072                 26.84
262144                198.38
524288                342.45
1048576               687.78
[arnoldg@dt-login01 collective]$ exit
exit
salloc: Relinquishing job allocation 1180009
[arnoldg@dt-login01 collective]$

Interactive X11 Support

To run an X11 based application on a compute node in an interactive session, the use of the --x11 switch with srun is needed. For example, to run a single core job that uses 1g of memory with X11 (in this case an xterm) do the following:

srun -A abcd-delta-cpu  --partition=cpu-interactive \
  --nodes=1 --tasks=1 --tasks-per-node=1 \
  --cpus-per-task=2 --mem=16g \
  --x11  xterm

Job Management

Batch jobs are submitted through a job script (as in the examples above) using the sbatch command. Job scripts generally start with a series of SLURM directives that describe requirements of the job such as number of nodes, wall time required, etc… to the batch system/scheduler (SLURM directives can also be specified as options on the sbatch command line; command line options take precedence over those in the script). The rest of the batch script consists of user commands.

The syntax for sbatch is:

sbatch [list of sbatch options] script_name

Refer to the sbatch man page for detailed information on the options.

Commands that display batch job and partition information .

SLURM EXAMPLE COMMAND	DESCRIPTION
squeue -a	List the status of all jobs on the system.
squeue -u $USER	List the status of all your jobs in the batch system.
squeue -j JobID	List nodes allocated to a running job in addition to basic information..
scontrol show job JobID	List detailed information on a particular job.
sinfo -a	List summary information on all the partition.

See the manual (man) pages for other available options.

srun

The srun command initiates an interactive job on compute nodes.

For example, the following command:

srun -A account_name --time=00:30:00 --nodes=1 --ntasks-per-node=16 \
--partition=gpuH100x8 --gpus=1 --mem=16g --pty /bin/bash

will run an interactive job in the gpuH100x8 partition with a wall clock limit of 30 minutes, using one node and 16 cores per node and 1 gpu. You can also use other sbatch options such as those documented above.

After you enter the command, you will have to wait for SLURM to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes for smaller amounts of time, the wait should be shorter because your job will backfill among larger jobs. You will see something like this:

srun: job 123456 queued and waiting for resources

Once the job starts, you will see:

srun: job 123456 has been allocated resources

and will be presented with an interactive shell prompt on the launch node. At this point, you can use the appropriate command to start your program.

When you are done with your work, you can use the exit command to end the job.

scancel

The scancel command deletes a queued job or terminates a running job.

scancel JobID deletes/terminates a job.

Job Status

NODELIST(REASON)

MaxGRESPerAccount - a user has exceeded the number of cores or gpus allotted per user or project for a given partition.

Useful Batch Job Environment Variablesslurm_environment_variables

D ESCRI PTION	SLURM EN VIRONMENT VARIABLE	DETAIL DESCRIPTION
JobID	$SLU RM_JOB_ID	Job identifier assigned to the job
Job Submi ssion Dire ctory	$SLURM_S UBMIT_DIR	By default, jobs start in the directory that the job was submitted from. So the “cd $SLURM_SUBMIT_DIR” command is not needed.
Mac hine( node) list	$SLURM _NODELIST	variable name that contains the list of nodes assigned to the batch job
Array JobID	$ SLURM_ARR AY_JOB_ID $S LURM_ARRA Y_TASK_ID	each member of a job array is assigned a unique identifier

See the sbatch man page for additional environment variables available.

Accessing the Compute Nodes

TGI RAILS implements the Slurm batch environment to manage access to the compute nodes. Use the Slurm commands to run batch jobs or for interactive access to compute nodes. See: https://slurm.schedmd.com/quickstart.html for an introduction to Slurm. There are two ways to access compute nodes on TGI RAILS.

Batch jobs can be used to access compute nodes. Slurm provides a convenient direct way to submit batch jobs. See https://slurm.schedmd.com/heterogeneous_jobs.html#submitting for details. Slurm supports job arrays for easy management of a set of similar jobs, see:job_array.html.

Sample Slurm batch job scripts are provided in the section below.

Direct ssh access to a compute node in a running batch job from a loginNN node is enabled, once the job has started.

$ squeue --job jobid
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12345       cpu     bash   gbauer  R       0:17      1 cn001

Then in a terminal session:

$ ssh cn001
cn001.delta.internal.ncsa.edu (172.28.22.64)
  OS: RedHat 8.4   HW: HPE   CPU: 128x    RAM: 252 GB
  Site: mgmt  Role: compute
$

Scheduler

For information, consult:

https://slurm.schedmd.com/quickstart.html

250 slurm quick reference guide

Partitions (Queues)

Table.Delta Production Default Partition Values

Property	Value
Default Memory per core	1000 MB
Default Wallclock time	30 minutes

Table.TGI RAILS Production Partitions/Queues

P artition /Queue	Node Type	Max Nodes per Job	Max Du ration	Max Running in Queue/ user*	Charge Factor
cpu	CPU	TBD	48 hr	TBD	1.0
cpu-int eractive	CPU	TBD	30 min	TBD	2.0
g puH100x8	o cta-H100	TBD	48 hr	TBD	1.5
gpuH1 00x8-int eractive	o cta-H100	TBD	1 hr	TBD	3.0

Node Policies

Node-sharing is the default for jobs. Node-exclusive mode can be obtained by specifying all the consumable resources for that node type or adding the following Slurm options:

--exclusive --mem=0

GPU NVIDIA MIG (GPU slicing) for the H100 may be supported at a future date.

Pre-emptive jobs will be supported at a future date.

Job Policies

The default job requeue or restart policy is set to not allow jobs to be automatically requeued or restarted.

To enable automatic requeue and restart of a job by slurm, please add the following slurm directive

--requeue

When a job is requeued due to an evant like a node failure, thebatch script is initiated from its beginning. Job scripts need to be written to handle automatically restarting from checkpoints etc.

Monitoring a Node During a Job

Refunds

Refunds are considered, when appropriate, for jobs that failed due to circumstances beyond user control.

Projects wishing to request a refund should email help+tgi@ncsa.illinois.edu. Please include the batch job ids and the standard error and output files produced by the job(s).