Computing¶

A compute cluster is made of individual nodes which are interconnected between themselves and which aggregate power can be harnessed to address problems which nature requires either distributed or capacity computing.

A node, or host, is a physical server comprised of multiple CPU cores (which are referred to as slots in the scheduler lingo) rack-mounted in one of our two data centers and interconnected to other nodes (as well as to the storage) to form the cluster.

Upon connecting/logging into the cluster (unless Connecting via Jupyter Notebook/Lab) users access the cluster via its login nodes. Login nodes are special hosts which sole purpose is to provide a gateway to the compute nodes and their computational resources. For more information, see Role of login nodes.

Computational resources (such as memory, cores, runtime, CPU-type, GPU-type, etc.) on board of compute nodes are managed by a job scheduler. Any CPU, GPU or memory intensive computing task should be performed within either interactive sessions or batch jobs scheduled on the cluster’s compute nodes.

This page describes:

resources available on the Hoffman2 Cluster
how to request interactive sessions
how to submit a non interactive job for batch execution
how to monitor resource utilization

Computational resources on the Hoffman2 Cluster¶

Node types
Group-owned nodes
Highp vs shared vs campus jobs
Jobs and resources
Requesting resources (other than cores)
Requesting multiple cores

Node types¶

A summary of the types of nodes that you will encounter while using the Hoffman2 Cluster and a description of their intended use is given in the Types of nodes on the Hoffman2 Cluster table:

Types of nodes on the Hoffman2 Cluster¶
login nodes	you access a login node upon connecting to the Hoffman2 Cluster via: terminal and SSH at the fully qualified domain name: hoffman2.idre.ucla.edu remote desktop at the NoMachine or X2Go services running on: nx.hoffman2.idre.ucla.edu or x2go.hoffman2.idre.ucla.edu respectively Note Login nodes are meant for light-weight tasks such as editing your code and submitting jobs to the scheduler. Login nodes are a resource shared by many concurrent users and are not intended for CPU or memory intensive tasks. Please see Role of login nodes. Important All CPU and/or memory intensive (as well as GPU) computations need to run on compute nodes accessed via the scheduler.
Data transfer nodes	The Hoffman2 Cluster has two dedicated and performance-tuned data transfer nodes with advanced parallel transfer tools to support your research workflows.
CPU-based compute nodes	Most of the nodes on the Hoffman2 Cluster are CPU-based compute nodes. These are where your jobs execute and can be accessed interactively via the qrsh command or by batch job execution.
GPU-based compute nodes	A portion of compute nodes on the Hoffman2 Cluster is equipped with one or more GPU cards available on the Hoffman2 Cluster of various types. Please refer to: Role of GPU nodes to see what workload is best suited to run on these nodes. Please refer to GPU access to learn how to request an interactive session or a batch job to run on a GPU node.

The Hoffman2 Cluster has a number of compute nodes available to the entire UCLA community. Additionally, research groups can purchase dedicated compute nodes. Users in groups who have contributed nodes to the cluster can access their nodes in a preferential fashion and for extended runtimes or access unused cores across the wider cluster (see: Highp vs shared vs campus jobs).

Group-owned nodes¶

Group-owned nodes, allow users to run jobs (interactive or batch) on their computational resources for an extended runtime (up to fourteen days). Moreover, the portion of the jobs submitted to owned-resources, that can be concurrently allocated on them, are guaranteed to start within twenty-four hours from their submission (wait time is typically less). Node ownership also allow users in that group to access any currently unused resource owned by a different group for up to a runtime of 24 hours.

If your group is interested in purchasing nodes, please visit: Purchasing additional resources.

Highp vs shared vs campus jobs¶

In the Hoffman2 Cluster jargon jobs submitted to owned resources are referred to as highp jobs while jobs submitted to other groups’ currently unused resources as shared jobs. Jobs submitted by users in groups that have not purchased nodes are limited to run on IDRE-owned resources for up to 24 hours; jobs from these users are referred to as campus jobs and the users as campus users.

Jobs and resources¶

To prevent resource contention and to distribute computations across the multiple compute nodes on the cluster, any type of CPU/GPU or memory intensive task should be executed on compute nodes by requesting interactive sessions, for interactive-type work, or submitting batch jobs to the scheduler.

The command that requests interactive sessions is: qrsh.
The command that submits batch jobs (which instructions are contained in a shell scrip) is: qsub.
The command that terminates a job is qdel.

Note

Each compute nodes on the Hoffman2 Cluster generally run simultaneous jobs from multiple users, to prevent automatic job termination and to ensure performance of every job it is important to request the right amount of resources when submitting a batch job or requesting an interactive session.

To learn which computational resources such as (but not limited to) run time, memory, and number of computing cores can be requested for your interactive sessions or batch jobs, please see:

Requesting resources (other than cores)
Requesting multiple cores

Note

If no attributes are specified the scheduler assumes that a batch job or interactive session will use 1 core, 1 GB of memory and that it will run for 2 hours on any available compute node on the cluster and it will dispatch the job accordingly.

Requesting resources (other than cores)¶

Examples of how to request resources

Within the Univa Grid Engine (UGE) (the job scheduler currently running on the cluster), any resource (other then the number of computing cores - to request which please see Requesting multiple cores) can be requested via key-value pairs, known as complexes, passed as the arguments of the -l option to qsub or qrsh. Some of the complexes that you might need to use, are shown in the Principal requestable resources table:

Principal requestable resources¶
name of key	type	default value	specifies
`h_rt`	TIME	2:00:00	runtime
`h_data`	MEMORY	1G	memory per process
`h_vmem`	MEMORY	1G	memory per job
`exclusive`	BOOLEAN	TRUE	run on owned resources
`highp`	BOOLEAN	TRUE	run on owned resources
`arch`	STRING	NONE	specify processor type
`gpu`	BOOLEAN	TRUE	run on GPU nodes
`P4`	BOOLEAN	TRUE	run on GPU node w/ P4 card
`RTX2080Ti`	BOOLEAN	TRUE	run on GPU node w/ RTX2080Ti cards
`V100`	BOOLEAN	TRUE	run on GPU node w/ P4 card
`A100`	BOOLEAN	TRUE	run on GPU node w/ A100 cards
`cuda`	RSMAP	1	number of GPU cards on same node
`A6000`	BOOLEAN	TRUE	run on GPU node w/ A6000 cards
`cuda`	RSMAP	1	number of GPU cards on same node

A complete list of complexes defined on the Hoffman2 Cluster can be obtained by issuing at the command line of a terminal connected to the Hoffman2 Cluster with the command:

$ qconf -sc

Examples of how to request resources¶

To request a runtime of, for example, 12 hours, use:

$ qrsh -l h_rt=12:00:00

$ qsub -l h_rt=12:00:00

To request, for example, 4GB of memory, use:

$ qrsh -l h_data=4G

$ qsub -l h_data=4G

To request a node in exclusive mode (e.g., all of its cores and memory):

$ qrsh -l exclusive

$ qsub -l exclusive

Warning

This applies only if your group has purchased computational resources. If you are unsure you can check with the command:

$ myresources

if you are in the campus resource group using this option will cause your job to never start.

To request to run on a owned-nodes:

$ qrsh -l highp

$ qsub -l highp

To request to run on any GPU node (but nodes with A100, the latter needs to be explicitly requested):

$ qrsh -l gpu

$ qsub -l gpu

Tip

To check the specifications and the types of GPU cards publicly available on the Hoffman2 Cluster please refer to the: GPU cards available on the Hoffman2 Cluster table.

To request to run on a GPU node with a specific GPU card, for example V100:

$ qrsh -l gpu,V100

$ qsub -l gpu,V100

To request to run on a node with a specific CPU card, for example a CPU in the intel-gold class:

$ qrsh -l arch=intel-gold\*

$ qsub -l arch=intel-gold\*

Note

Possible values of the arch complex can be queried by issuing the command:

$ qhost -F arch | awk -F = '{print $2}'  | grep -v ^$ | sort | uniq

Note

Several resources can be requested either as a command-separated list of key-values pairs following the -l option, or by space-separated -l key-value pair options. For example to request a run time of 3 hours and 4GB of memory:

$ qrsh -l h_rt=3:00:00,h_data=4G

or:

$ qrsh -l h_rt=3:00:00 -l h_data=4G

$ qsub -l h_rt=3:00:00,h_data=4G

or:

$ qsub -l h_rt=3:00:00 -l h_data=4G

Requesting multiple cores¶

Examples of how to request multiple cores

If you are planning to run an application that will use more than one CPU core, you should request cores using the -pe <parallel environment> <n> directive (where: <parallel environment> is the name of the parallel environment and <n> is the number of cores that you are planning to use) to the qrsh or qsub commands.

A list of the principal parallel environment names and their role is given in the parallel environment table.

Principal parallel environments table¶
name	allocation rule	use
`shared`	cores are allocated on a single host	shared memory jobs
`dc\*`	cores are allocated on any host	Distributed memory jobs
`node`	one core per node	use with `-l exclusive` for hybrid distributed/shared memory jobs

Examples of how to request multiple cores¶

To run an applications that uses multiple cores in shared-memory (e.g., threads, openmp, etc.), use, for example, to request 12 cores:

$ qrsh -pe shared 12

$ qsub -pe shared 12

To run an application that uses multiple cores in a distributed-memory fashion (e.g., MPI), use, for example, to request 72 cores:

$ qrsh -pe dc\* 72

$ qsub -pe dc\* 72

To run an application that uses multiple cores in a combination of shared-memory (within a node) and distributed-memory (across nodes), use, for example, to run across 6 nodes (using all the resources on each of them):

$ qrsh -l exclusive -pe node 6

$ qsub -l exclusive -pe node 6

Note

If, for symmetry of computational speed and to have the same number of cores on each node, you need to request a specific CPU-type, use, for example to request 3 nodes each with an intel-gold type of CPU, use:

$ qrsh -l exclusive,arch=intel-gold\* -pe node 6

$ qsub -l exclusive,arch=intel-gold\* -pe node 6

Possible values of the arch complex can be queried by issuing the command:

$ qhost -F arch | awk -F = '{print $2}'  | grep -v ^$ | sort | uniq

A complete list of parallel environments available on the Hoffman2 Cluster can be obtained by issuing the command:

$ qconf -spl

Requesting multiple GPU cards on the same node¶

GPU nodes with RTX2080Ti or A100 on the Hoffman2 Cluster have multiple GPU cards, to request more than one GPU card on the same node users should add -l cuda=N to their batch job/interactive session resources requested (where N will depend on the GPU node, see Number of cards per GPU node).

Number of cards per GPU node¶
GPU card type	Number of cards per node	Scheduler option to request number of cards
A100	4	-l gpu,A100,cuda={1,4}
V100	1	-l gpu,V100,cuda=1
RTX2080Ti	2	-l gpu,RTX2080Ti,cuda={1,2}
P4	1	-l gpu,P4,cuda=1

Basic usage¶

An interactive session allows you to access computing resources (e.g., cores, memory, GPUs, etc.) on the nodes comprising the cluster for a given amount of time. To request an interactive session, from a terminal connected to the Hoffman2 cluster issue the command:

$ qrsh

after issuing the command above, the shell prompt will typically return after a short wait and your prompt will typically change to display the compute node on which your interactive session is running. For example, user joebrun could experience the following change in prompt:

[joebruin@login3 ~]$ qrsh
[joebruin@n2001 ~]$

from the login nodes, login3, to the compute node, n2001.

Note

Unless you have otherwise requested, by default you have access to 1GB or memory, one computing core and 2 hours run-time on any node on the cluster that is available to you.

Customizing the qrsh command¶

The qrsh command can be customized to allow you to request the needed runtime, amount of memory, number of cores, whether the cores requested will be from one or more compute nodes, the type or CPU, the type of GPU, and many other requestable characteristics. Each of the resources that a user can request is specified by a comma-separated list of key-value pairs, known as complexes, which follow the -l option to the qrsh command, while the number of cores is specified by the -pe option to the qrsh command followed by a space separate list of two items: the name of the parallel environment needed (which will be suitable to shared, distributed or hybrid memory use) and the integer number specifying the number of cores requested.

To learn more how to request resources or compute cores, please see: Requesting resources (other than cores) and Requesting multiple cores.

qrsh command to run serial jobs¶

Serial jobs use one compute core and therefore there is no need to specify the parallel environment and the number of cores. To get an interactive session with a runtime longer than the default 2 hours and more memory than the default 1GB, you will need to specify a value for the scheduler complex h_rt (runtime) and a value for the complex h_data (memory).

For example, to request an interactive session with a runtime of 3 hours and a total of 4GB of memory, issue at the Hoffman2 command prompt:

$ qrsh -l h_rt=3:00:00,h_data=4G

Warning

The scheduler is configured to automatically terminate jobs that will attempt to use more memory than it was requested or to continue to run past the time limit. Make sure to request enough memory and runtime in order to keep your interactive session active.

qrsh command to run shared memory jobs¶

If your application spawns multiple threads, or, more generally, uses multiple cores in a shared memory parallelization paradigm, you will need to request the number of cores you are planning to use with the pe shared <n> directive (where <n> is the number of cores requested).

For example, to request 4 CPU core, a runtime of 8 hours, and 2GB of memory per core, issue:

$  qrsh -l h_rt=8:00:00,h_data=2G -pe shared 4

qrsh command to run distributed memory jobs¶

If program you intend to run in the interactive session, can run across multiple nodes (using message passing libraries), you will need to request cores with -pe dc\* <n>`(where ``<n> is the number of cores requested).

For example, to request 16 CPU core, a runtime of 1 hour, and 2GB of memory per core, issue:

$ qrsh -l h_rt=1:00:00,h_data=2G -pe dc\*  16

qrsh command to run hybrid distributed/shared memory jobs¶

If your program can execute in shared memory within a node and in distributed memory across nodes (for example, it can do openmp in combination with MPI), you should request an interactive session requesting multiple nodes and all the cores within it. To do so you can use the combination of the node parallel environment and the exclusive complex.

For example, to request 3 entire nodes for a runtime of 1 hour, with each node having at least 36 GB of memory, issue:

$ qrsh -l h_rt=1:00:00,h_data=36G,exclusive -pe node  3 -now n

Interactive sessions awarded with qrsh attempts to start jobs immediately, to prevent an interactive session from exiting if resources are not currently available you can add: -now n.

Warning

Requesting one or more nodes in exclusive mode may cause a relatively long wait time for the interactive session to be awarded. If you need these type of resources you should consider running your job in batch.

qrsh command to run on exclusively reserved nodes¶

When invoking an interactive session with qrsh, the proper memory size needs to be specified via h_data. If you are unsure of what amount of memory is appropriate for your interactive session, you could add -l exclusive to your qrsh command. In this case, the h_data value is used by the scheduler to select a compute node having a total amount of memory equal or greater than what specified with h_data. In this case, the memory limit for the job is the compute node’s physical memory size.

For example, the command:

$ qrsh -l h_rt=8:00:00,h_data=32G,exclusive

will start an interactive session on a compute node equipped with at least 32G of physical memory. The node will be exclusively reserved for you and you can therefore use all of its cores and memory (despite the h_data value).

Note

You can only request as much memory as is available on nodes on the cluster. Interactive session requested via qrsh without specifying an h_data value are automatically assigned an h_data=1G, which may or may not be too small for your application.

qrsh command to run on your group’s nodes¶

Warning

The following section does not apply to you if your research group has not purchased Hoffman2 compute nodes.

To run on your group nodes, add the -l highp switch to your qrsh command. For example, to request an interactive session with a duration of two days (48 hours), 4GB of memory (and one core), issue the command:

$ qrsh -l highp,h_rt=48:00:00,h_data=4G

You could also request multiple cores using the -pe dc\* <n>, -pe shared <n> or -l exclusive -pe node <n> as described in Requesting multiple cores. When combining with -l highp, the amount of cores, or the memory requested, needs to be compatible with what is available on your group compute nodes. Contact user support should you have any questions.

Although you are allowed to specify h_rt as high as 336 hours (14 days) for a qrsh session, it is not recommended. For example, if the network connection is interrupted (e.g. your laptop or desktop computer goes into sleep mode), the qrsh session may be lost, possibly terminating all running programs within it.

qrsh examples¶

Note

Multiple resources can be requested with the -l option to qrsh. Each key=value complex needs to be given as comma-separated list without any white space in between (e.g., -l key1=value1,key2=value2). Alternative separate -l options can be specified (e.g., -l key1=value1 -l key2=value2).

To request a single processor for 24 hours from the interactive queues, issue the command:

$ qrsh -l h_rt=24:00:00,h_data=1G

To request 8 processors for 4 hours (total 8*1G=8GB memory) on a single node from the interactive queues, issue the command:

$ qrsh -l h_data=1G,h_rt=4:00:00,h_data=1G -pe shared 8

To request 4 processors for 3 hours (total 4*1G=4GB memory) on a single node, issue the command:

$ qrsh -l h_data=1G,h_rt=3:00:00,h_data=1G -pe shared 4

To request 12 processors, 1GB of memory per processor, for 2 hours, issue the command:

$ qrsh -l h_data=1G,h_rt=2:00:00 -pe dc\* 12
Note

The 12 CPUs are distributed across multiple compute nodes. The backslash \ in dc\* is significant when you issue this command in an interactive csh/tcsh Unix shell.

qrsh startup time¶

A qrsh session is scheduled along with all other jobs managed by the scheduler software. The shorter time (the -l h_rt option), and the fewer number of processors (the -pe option), the better chance you have of getting a session. Request just what you need for the best use of computing resources. Be considerate to other users by exiting your qrsh session when you are done to release the computing resources to other users.

Resource limitation¶

Hoffman2 Cluster’s compute nodes have different memory sizes. When you request more then one core (using: -pe shared <n>), the total memory requested on the node will be the product of the number of cores time the memory per core (h_data). In general, the larger the total memory requested, the longer the wait. Please refer to the output of the command:

$ qhost

to see what total memory is available on the various nodes on the cluster, keeping in mind that not all hosts may be accessible to you.

When you request multiple cores, or a large amount of total memory, you may or may not get the interactive session immediately, depending on how busy the cluster is and the permission level of your account. To see to which class of nodes (memory, number of cores, etc.) you have access to, you can enter the following at the Hoffman2 command prompt:

$  myresources

Interpreting error messages¶

Occasionally, you may encounter one of the following messages: error: no suitable queues or qrsh: No match.

If you receive the no suitable queues message and you are requesting the interactive queues (-l i), be sure you have not requested more than 24 hours. This message may mean there is something incompatible with the various parameters you have specified and your qrsh session can never start. For example, you have requested -l h_rt=25:00:00 but your userid is not authorized to run sessions or jobs for more than 24 hours.

If your session could not be scheduled, first try your qrsh command again in case it was a momentary problem with the scheduler.

If your session still cannot be scheduled, try lowering either the value of h_rt, the number of processors requested, or both, if possible.

Contact user support should you still have problems.

Running MPI with qrsh¶

The following instructions apply to the IntelMPI and the OpenMPI libraries. They may not apply to other MPI implementations.

After requesting an interactive session to run distributed memory jobs, you will need to select the version of IntelMPI/OpenMPI and to set the environment for the scheduled job. In the following example the executable MPI program is named foo.

In the qrsh session at the shell prompt, enter one of the following commands:

If you are in bash or sh-type shell and you need a specific version of the IntelMPI (say: intel/19.0.5):

$ module load intel/19.0.5           # load the intel/19.0.5 module
$ . /u/local/bin/set_qrsh_env.sh     # set the environment for the scheduled job
$ `which mpirun` -n $NSLOTS ./foo    # run the foo MPI executable

If you are in tcsh or csh shell and you need a specific version of the IntelMPI (say: intel/19.0.5):

$ module load intel/19.0.5                 # load the intel/19.0.5 module
$ source /u/local/bin/set_qrsh_env.csh    # set the environment for the scheduled job
$ `which mpirun` -n $NSLOTS ./foo    # run the foo MPI executable

If you are in bash or sh-type shell and you need a specific version of the OpenMPI (say: openmpi/3.1.6):

$ module load openmpi/3.1.6          # load the openmpi/3.1.6
$ . /u/local/bin/set_qrsh_env.sh     # set the environment for the scheduled job
$ `which mpirun` -n $NSLOTS ./foo    # run the foo MPI executable

If you are in tcsh or csh shell and you need a specific version of the OpenMPI (say: openmpi/3.1.6):

$ module load openmpi/3.1.6               # load the openmpi/3.1.6
$ source /u/local/bin/set_qrsh_env.csh    # set the environment for the scheduled job
$ `which mpirun` -n $NSLOTS ./foo         # run the foo MPI executable

You could replace $NSLOTS with an integer, which is less than the number of processors you requested on your qrsh command if needed.

You do not have to create a hostfile and pass it to mpiexec.hydra with its -machinefile or -hostfile option because mpiexec.hydra automatically retrieves that information from UGE.

Additional tools¶

Additional scripts are available that may help you run other parallel distributed memory software. You can enter these commands at the compute node’s shell prompt:

$ get_pe_hostfile

Returns the contents of the UGE pe_hostfile file for the current qrsh session. If you have used the -pe directive to request multiple processors on multiple nodes, you will probably need to tell your program the names of those nodes and how many processors have been allocated on each node. This information is unique to your current qrsh session.

To create an MPI-style hostfile named hfile in the current directory:

$ get_pe_hostfile | awk '{print $1" slots="$2}' > hfile

The UGE pe_hostfile is located:

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.1/pe_hostfile

where node and sge_jobid are the hostname and UGE $JOB_ID, respectively, of the current qrsh session.

To return the value of JOB_ID for the current qrsh session, issue the command:

$ get_sge_jobid

To return the contents of the scheduler environment file for the current qrsh session, issue:

$ get_sge_env

which is used by the set_qrsh_env scripts.

UGE-specific environment variables are defined in the file:

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.1/environment

or,

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.sge_taskid/environment

where node and sge_jobid are the hostname and UGE $JOB_ID, respectively, of the current qrsh session. sge_taskid is the task number of a array job $SGE_TASK_ID.

Problems with the instructions on this section? Please send comments here.

Submitting batch jobs¶

In order to run a non-interactive batch job under the Univa Grid Engine (UGE), you need to specify the resources and the number of cores that your job will need and the actual command (or a recipe consisting of multiple commands) to execute.

In this section the following topics are discussed:

Use qsub with a submission script
How to build a submission script
Running array jobs
Parallel MPI jobs
Multi-threaded/OpenMP jobs
How to reserve one (or more) entire node(s)
How to run on owned nodes
Use qsub to submit a binary from the command line
Queue scripts

Use qsub with a submission script¶

A submission script allows you to set the environment for your job (for example by loading a needed module) and/or to codify a sequence of commands (for example for actions that need to occur in sequence).

Once you have generated a submission script you can submit your job with:

$ qsub <submission-script>

where: <submission-script> is the name of your submission script.

You can also define (or redefine) resource at the command line. For example to requests the complexes key1/value1 and key/value2 and to change the parallel environment or the number of cores requested (say to shared and 8), you could use:

$ qsub -l key1=value1,key2=value2 -pe shared 8 <submission-script>

Note

The resources, parallel environment and number of cores requested as options to qsub on the command line take the precedence on the resources, parallel environment and number of cores specified within the submission script.

Use qdel to terminate a job¶

After a job is submitted, you can use qdel to terminate it:

$ qdel <JOB_ID>

where <JOB_ID> is the job ID of the job being terminated. The job ID can be displayed by the myjobs command.

How to build a submission script¶

In this section an example of a basic submission script (written in shell scripting language bash) is described. You can copy and paste the script in a file on the cluster. The script should be modified (as instructed in its comment lines) to suit your requirements in terms of resources, number of cores, job environment and the actual commands that you will need to run.

Basic submission script¶

#### submit_job.sh START ####
#!/bin/bash
#$ -cwd
# error = Merged with joblog
#$ -o joblog.$JOB_ID
#$ -j y
## Edit the line below as needed:
#$ -l h_rt=1:00:00,h_data=1G
## Modify the parallel environment
## and the number of cores as needed:
#$ -pe shared 1
# Email address to notify
#$ -M $USER@mail
# Notify when
#$ -m bea

# echo job info on joblog:
echo "Job $JOB_ID started on:   " `hostname -s`
echo "Job $JOB_ID started on:   " `date `
echo " "

# load the job environment:
. /u/local/Modules/default/init/modules.sh
## Edit the line below as needed:
module load gcc/4.9.3

## substitute the command to run your code
## in the two lines below:
echo '/usr/bin/time -v hostname'
/usr/bin/time -v hostname

# echo job info on joblog:
echo "Job $JOB_ID ended on:   " `hostname -s`
echo "Job $JOB_ID ended on:   " `date `
echo " "
#### submit_job.sh STOP ####

To submit the job issue at the command line:

$ chmod u+x submit_job.sh
$ qsub submit_job.sh

To understand the Basic submission script its parts are analyzed in the following sections.

Submission script preamble¶

#### submit_job.sh START ####
#!/bin/bash
#$ -cwd
# error = Merged with joblog
#$ -o joblog.$JOB_ID
#$ -j y
## Edit the line below as needed:
#$ -l h_rt=1:00:00,h_data=1G
## Modify the parallel environment
## and the number of cores as needed:
#$ -pe shared 1
# Email address to notify
#$ -M $USER@mail
# Notify when
#$ -m bea

The submission script preamble contains the resources information (lines starting with: #$ -l and #$ -pe) that the scheduler needs to properly dispatch the job. You will need to edit these lines to match your needs (see: Requesting resources (other than cores) and Requesting multiple cores to learn how to do so). The meaning of other scheduler-specific lines is explained in the Principal options to the qsub command.

Lines starting with #$ are interpreted by the scheduler, while lines starting with # are comments inserted for clarity and lines starting with ## are meant to inform you which lines you should modify.

Submission script logging abilities¶

Lines starting with echo, once the job is running, will output to the file joblog.$JOB_ID useful information about the node on which the job is running, the start and end time, and the command that is being executed.

Submission script: setting the job environment¶

The part of submit_job.sh that loads the environment for the job is:

# load the job environment:
. /u/local/Modules/default/init/modules.sh
## Edit the line below as needed:
module load gcc/4.9.3

you should modify the module load gcc/4.9.3 line and add any number of module load <app> lines as needed (see: Environmental modules).

Submission script: recipe to run the command¶

Finally, the part of submit_job.sh that actually expresses the command(s) to run is:

## substitute the command to run your code
## in the two lines below:
echo '/usr/bin/time -v hostname'
/usr/bin/time -v hostname

in this example the command to be run is the Unix command hostname which simply return the name of the host on which the job is running. The command is executed from within /usr/bin/time -b which will output in the file joblog.$JOB_ID useful information about the resource consumption.

Note

The environment variable $JOB_ID is set up by the Univa Grid Engine scheduler to uniquely identify each of your jobs. Should you need to contact support about a job please provide its $JOB_ID.

Running array jobs¶

If you need to perform a series of operations each independent from the other, you can consider breaking these operations into independent tasks each running as its own job. In this circumstance you can use the Univa Grid Engine Array Job function. An array job is an array of identical tasks being differentiated only by an index number and treated by the scheduler as a series of jobs.

To access this function of the Univa Grid Engine scheduler you will need to add to the submission script preamble the line:

#$ -t lower-upper:interval

where the arguments: lower, upper and interval of the -t option represent the boundaries of the index associated with each task in the series of jobs. Their values are available within each jobs in the array through the environment variables: $SGE_TASK_FIRST, $SGE_TASK_LAST and $SGE_TASK_STEPSIZE.

The environment variable $SGE_TASK_ID is the index variable for each task in the array job it can be used as the index in a loop, which instead of being executed serially is executed in parallel by the independent tasks. To clarify this an array job example is given below.

Array job example¶

As an example of an array job, let’s consider the operation of adding two vectors. In this particular example, vector v1 (which 49 components go sequentially from 1 to 49) and vector v2 (which 49 components go in decreasing order from 99 to 51) are added to form vector v3 (which 49 components are all going to be equal to 100). This is a toy-example with a mere didactic purpose.

To understand how the process worksm we will first perform the operation sequentially with, for example, the following script:

add_two_vectors_sequentially.sh¶

#### add_two_vectors_sequentially.sh START ####
#!/bin/bash

# create new vector data files for v1 and v2
for i in `seq 1 49`;do
 if [ $i == 1 ]; then
     echo $i > v1.dat
     echo $((100-$i)) > v2.dat
 else
     echo $i >> v1.dat
     echo $((100-$i)) >> v2.dat
 fi
done

# now add and save in v3.dat:
for i in `seq 1 49`;do
 # use the Unix command sed -n ${line_number}p to read by line
 v1_c=`sed -n ${i}p v1.dat`
 v2_c=`sed -n ${i}p v2.dat`
 v3_c=$((v1_c+v2_c))
 if [ $i == 1 ]; then
     echo $v3_c > v3.dat
 else
     echo $v3_c >> v3.dat
 fi
done
#### add_two_vectors_sequentially.sh STOP ####

after creating this script you could submit it for batch execution with:

$ chmod u+x add_two_vectors_sequentially.sh # mark the script as executable
$ qsub -l h_rt=200,h_data=100M -o joblog -j y add_two_vectors_sequentially.sh

This computation, however, could also be broken in a number of tasks of which each performs the addition of a particular component of the vectors v1 and v2. To do so you will need to first create the files for the vectors v1 and v2, you can do so for example with the script:

create_vectors.sh¶

#### create_vectors.sh START ####
#!/bin/bash

# create new vector data files for v1 and v2
for i in `seq 1 49`;do
 if [ $i == 1 ]; then
     echo $i > v1.dat
     echo $((100-$i)) > v2.dat
 else
     echo $i >> v1.dat
     echo $((100-$i)) >> v2.dat
 fi
done
#### create_vectors.sh STOP ####

which you can execute by issuing at the command line:

$ chmod u+x create_vectors.sh
$ ./create_vectors.sh

you will then need to modify the add_two_vectors_sequentially.sh script that performs the addition to look like:

add_by_component.sh¶

#### add_by_component.sh START ####
#!/bin/bash

if [ -e  v1.dat ]; then
   # use the Unix command sed -n ${line_number}p to read by line
   c_v1=`sed -n ${SGE_TASK_ID}p v1.dat`
else
   c_v1=0
fi

if [ -e v2.dat ]; then
   # use the Unix command sed -n ${line_number}p to read by line
   c_v2=`sed -n ${SGE_TASK_ID}p v2.dat`
else
   c_v2=0
fi

c_v3=$((c_v1+c_v2))

echo $c_v3 > v3_${SGE_TASK_ID}.dat
#### add_by_component.sh START ####

Note

that the index $i of the add_two_vectors_sequentially.sh script has been replaced by the $SGE_TASK_ID environmental variable in the add_by_component.sh script and that the for loop is gone.

To submit the script add_by_component.sh for batch execution you could use the submission script:

Array Job submission script¶

#### submit_arrayjob.sh START ####
#!/bin/bash
#$ -cwd
# error = Merged with joblog
#$ -o joblog.$JOB_ID.$TASK_ID
#$ -j y
## Edit the line below as needed:
#$ -l h_rt=200,h_data=50M
## Modify the parallel environment
## and the number of cores as needed:
#$ -pe shared 1
# Email address to notify
#$ -M $USER@mail
# Notify when
#$ -m bea
#$ -t 1-49:1

# echo job info on joblog:
echo "Job $JOB_ID.$SGE_TASK_ID started on:   " `hostname -s`
echo "Job $JOB_ID.$SGE_TASK_ID started on:   " `date `
echo " "

# load the job environment:
. /u/local/Modules/default/init/modules.sh
## Edit the line below as needed:
#module load gcc/4.9.3

## substitute the command to run your code
## in the two lines below:
echo '/usr/bin/time -v ./add_by_component.sh'
/usr/bin/time -v ./add_by_component.sh

# echo job info on joblog:
echo "Job $JOB_ID.$SGE_TASK_ID ended on:   " `hostname -s`
echo "Job $JOB_ID.$SGE_TASK_ID ended on:   " `date `
echo " "
#### submit_arrayjob.sh STOP ####

which you can then submit it with:

$ chmod u+x submit_arrayjob.sh
$ qsub submit_arrayjob.sh

In this example the script add_by_component.sh will run 49 times, each time operating on one of the components of vectors v1 and v2 by reading the line corresponding to $SGE_TASK_ID of the files v1.dat and v2.dat.

To stitch back the vector v3 you can use a script like:

stitch_v3.sh¶

#### stitch_v3.sh START ####
#!/bin/bash

for i in `seq 1 49`;do
 if [ $i == 1 ]; then
   cat v3_$i.dat > v3.dat
 else
   cat v3_$i.dat >> v3.dat
  fi
done
#### stitch_v3.sh STOP ####

which you can then execute with:

$ chmod u+x stitch_v3.sh   # mark the script as executable
$ ./stitch_v3.sh

The file v3.dat will now contain the 49 components of the vector v3.

Note

You can run the scripts: create_vectors.sh and stitch_v3.sh from the command line (without being in an interactive session) because the two scripts do not require much in terms of resources - as this is a toy example. Should you pre and post array job creation tasks require more resources you should submit them as batch jobs or from within an interactive session.

Problem with these instructions? Please let us know.

Parallel MPI jobs¶

For a parallel MPI job you need to have a line that specifies a parallel environment:

#$ -pe dc* n

The maximum number of cores requested,``n``, that you should use depends on your account’s access level.

Multi-threaded/OpenMP jobs¶

For a multi-threaded OpenMP job you need to request that all processors be on the same node by using the shared parallel environment.

#$ -pe shared n

where the maximum n, the number of slots requested, can be no larger than the number of CPU/cores of a compute node.

How to reserve one (or more) entire node(s)¶

To get one or more entire nodes for parallel jobs, use -pe node* n -l exclusive, where n is the number of nodes you are requesting.

Example of requesting 2 whole nodes with qsub:

$ qsub -pe node 2 -l exclusive mysubmissionscript.sh

Example of requesting 3 whole nodes in the preamble of a submission script:

#$ -l exclusive
#$ -pe node 3

How to run on owned nodes¶

To run a batch job on owned nodes:

$  qsub -l highp[,other-options] mysubmissionscript.sh

Example of requesting to run on owend nodes in the preamble of a submission script:

#$ -l highp[,other-options]

Use qsub to submit a binary from the command line¶

For example, suppose that you want to run the binary program $HOME/fortran/hello_world, you may submit the job from the Hoffman2 command line with:

$ qsub -l h_rt=200,h_data=50M -o $SCRATCH/hello_world.out -j y -M $USER@mail -m bea -b y $HOME/fortran/hello_world

Principal options to the qsub command¶

`-l h_rt=200,h_data=50M`	requests the type of resources to be used by the command `hello_world`
`-o $SCRATCH/hello_world.out`	sets the path to where the standard output stream of the job will be written
`-j y`	specifies that the standard error stream of the job is merged into the standard output stream
`-M $USER@mail`	defines the email address to notify (please leave this field unchanged)
`-m bea`	defines when to notify the recipient with an email (in this case it will notify at the beginning, `b`, of the job, if the job is aborted or rescheduled, `e`, and at the end, `e`, of the job)
`-b y`	gives the user the possibility to indicate explicitly that the command to be executed (the binary `hello_world` in this case) is to be treated as binary (by default `qsub` sets it has `-b n` and therefore expects a script as the input command to run)
`$HOME/fortran/hello`	the input command to `qsub`

To see a complete list of options that you can pass to the command qsub please issue:

$ man qsub

and also refer to the Requesting resources (other than cores) and Requesting multiple cores sections.

hello_world example¶

As an example of batch submission of a binary, the procedure to generate the binary hello_world and submit it to the queues for batch execution is described in what follows.

Note

The steps below can be performed either in an interactive session or on a login node as the editing and compilation of this particular example do not represet a demanding computational task.

create the directory $HOME/fortran if needed and cd to it:

$ bash
$ module unload gcc # fall back to default gcc in case a different one was loaded
$ if [ ! -d $HOME/fortran ]; then mkdir $HOME/fortran; fi; cd fortran

using any of the editors availane on the cluster paste in the file: hello_world.f in the current directory (i.e., $HOME/fortran) the following lines:
```
c     hello_world START
      program hello_world
      print *, 'Hello World!'
      end program hello_world
c     hello_world STOP
```

compile the program:

$ gfortran -o hello_world hello_world.f

check that the executable binary, hello_world, runs:
```
$ ./hello_world
```
should give you:
```
Hello World!
```

submit the the executable binary, hello_world, for batch execution:

$ qsub -l h_rt=200,h_data=50M -o $SCRATCH/hello_world.out -j y -M $USER@mail -m bea -b y $HOME/fortran/hello_world

once you get your email that the job has completed you can check the output with:
```
$ cat $SCRACH/hello_world.out
```
which should look like:
```
Hello World!
```

Queue scripts¶

Each IDRE-provided queue script is named for a type of job or application. The queue script builds a UGE command file for that particular type of job or application. A queue script can be run either as a single command to which you provide appropriate options, or as an interactive application, which presents you with a menu of choices and prompts you for the values of options.

For example, if you simply enter a queue script command such as:

job.q

without any command-line arguments, the queue script will enter its interactive mode and present you with a menu of tasks you can perform. One of these tasks is to build the command file, another is to submit a command file that has already been built, and another is to show the status of jobs you have already submitted. See queue scripts for details, or select Info from any queue script menu, or enter man queue at a shell prompt.

You can also enter myjobs at the shell prompt to show the status of jobs you have submitted and which have not already completed. You can also enter groupjobs at the shell prompt to show the status of pending jobs everyone in your group has submitted. Enter groupjobs -help for options.

IDRE-provided queue scripts can be used to run the following types of jobs:

Serial Jobs
Serial Array Jobs
Multi-threaded Jobs
MPI Parallel Jobs
Application Jobs

Serial jobs¶

A serial job runs on a single thread on a single node. It does not take advantage of multi-processor nodes or the multiple compute nodes available with a cluster.

To build or submit an UGE command file for a serial job, you can either enter:

job.q [queue-script-options]

or, you can provide the name of your executable on the command line:

job.q [queue-script-options] name_of_executable [executable-arguments]

When you enter job.q without the name of your executable, it will interactively ask you to enter any needed memory, wall-clock time limit, and other options, and ask you if you want to submit the job. You can quit out of the queue script menu and edit the UGE command file, which the script built, if you want to change or add other Univa Grid Engine options before you submit your job.

If you did not submit the command file at the end of the menu dialog and decided to edit the file before submitting it, you can submit your command file using either a queue script Submit menu item, or the qsub command:

qsub executable.cmd

When you enter job.q with the name of your executable, it will by default build the command file using defaults for any queue script options that you did not specify, submit it to the job scheduler, and delete the command file that it built.

Serial array jobs¶

Array jobs are serial jobs or multi-threaded jobs that use the same executable but different input variables or input files, as in parametric studies. Users typically run thousands of jobs with one submission.

The UGE command file for a serial array job will, at the minimum, contain the UGE keyword statement for a lower index value and an upper index value. By default, the index interval is one. UGE keeps track of the jobs using the environment variable SGE_TASK_ID, which varies from the lower index value to the upper index value for each job. Your program can use SGE_TASK_ID to select the input files to read or the options to be used for that particular run.

If your program is multi-threaded, you must edit the UGE command file built by the jobarray.q script and add an UGE keyword statement that specifies the shared parallel environment and the number of processors your job requires. In most cases you should request no more than 8 processors because the maximum number of processors on most nodes is 8. See the For a multi-threaded OpenMP job section for more information.

To build or submit an UGE command file for a serial array job, enter:

jobarray.q

For details, see the section Running an Array of Jobs Using UGE.

Multi-threaded jobs¶

Multi-threaded jobs are jobs which will run on more than one thread on the same node. Programs using the OpenMP-based threaded library are a typical example of those that can take advantage of multi-core nodes.

If you know your program is multi-threaded, you need to request that UGE allocate multiple processors. Otherwise your job will contend for resources with other jobs that are running on the same node, and all jobs on that node may be adversely affected. The queue script will prompt you to enter the number of tasks for your job. The queue script default is 4 tasks. You should request at least as many tasks as your program has threads, but usually no more than 8 tasks because the maximum number of processors on most nodes is 8. See the scalability benchmarks in the GPU cards available on the Hoffman2 Cluster table for information on how to determine the optimal number of tasks.

To build or submit an UGE command file for a multi-threaded job, enter:

openmp.q

For details, see OpenMP programs and Multi threaded programs.

MPI parallel jobs¶

MPI parallel jobs are those executable programs that are linked with one of the message passing libraries like OpenMPI. These applications explicitly send messages from one node to another using either a Gigabit Ethernet (GE) interface or Infiniband (IB) interface. IDRE recommends that everyone use the Infiniband interface because latency for message passing is short with the IB interface compared to the GE interface.

When MPI jobs are submitted to the cluster, one needs to tell the UGE scheduler how many processors are needed to run the jobs. The queue script will prompt you to enter the number of tasks for your job. The queue script default for generic jobs is 4 parallel tasks. Please see the scalability benchmarks at GPU cards available on the Hoffman2 Cluster table for information on how to determine the optimal number of tasks.

To build or submit an UGE command file for a parallel job, enter:

mpi.q

For details, see the How to Run MPI section.

Application jobs¶

An application job is one which runs software provided by a commercial vendor or is open source. It is usually installed in system directories (e.g., MATLAB).

To build or submit an UGE command file for an application job, enter:

application.q

where application is replaced with the name of the application. For example, use matlab.q to run MATLAB batch jobs. For details, see Software and its subsequent links for each package or program on to how to run them.

Batch job output files¶

When a job has completed, UGE messages will be available in the stdout and stderr files that were were defined in your UGE command file with the -o and -e or -j keywords. Program output will be available in any files that your program has written.

If your UGE command file was built using a queue script, stdout and stderr from UGE will be found in one of:

jobname.joblog
jobname.joblog.$JOB_ID
jobname.joblog.$JOB_ID.$SGE_TASK_ID (for array jobs)

Output from your program will be found in one of:

jobname.out
jobname.out.$JOB_ID
jobname.output.$JOB_ID
jobname.output.$JOB_ID.$SGE_TASK_ID (for array jobs)

Problems with the instructions on this section? Please send comments here.

GPU access¶

How to access GPU nodes¶

There are multiple GPU types available in the cluster. Each type of GPU has a different compute capability, memory size and clock speed, among other things. Please refer to table GPU cards available on the Hoffman2 Cluster to see what GPUs are currently available.

GPU cards available on the Hoffman2 Cluster¶
GPU type	Compute capability	# of CUDA cores	GPU memory	scheduler options
H100	9.0	16896	94 GB	-l gpu,H100,cuda=1
L40S	8.9	18176	48 GB	-l gpu,L40S,cuda=1
A6000	8.6	10752	48 GB	-l gpu,A6000,cuda=1
A100	8.0	6912	80 GB or 40 GB	-l gpu,A100,cuda=1
RTX2080Ti	7.5	4352	10 GB	-l gpu,RTX2080Ti,cuda=1
V100	7.0	5120	32 GB	-l gpu,V100,cuda=1
GTX1080Ti	6.11	3584	12 GB	-l gpu,GTX1080Ti,cuda=1
Tesla P4	6.1	2560	8 GB	-l gpu,P4,cuda=1
Tesla K40	3.5	2880	12 GB	-l gpu,K40,cuda=1

If GPU memory is important for your job and you will need to run on one (or more) A100 cards with 80GB of memory, please use:

$ qrsh/qsub -l A100,gpu,gpu_mem=80G,cuda=1

To see a full list of GPU nodes on the Hoffman2 Cluster which includes the number of GPU cards per host, at the cluster prompt issue:

$ GPU_NODES_AT_A_GLANCE

Important

To see which nodes have GPU cards currently available, look at the #Used/#GPUs column of the output of the command (this is particularly helpful when trying to get an interactive session):

$ GPU_NODES_AT_A_GLANCE_WITH_USED_GPUs

To see the queue status of GPU nodes on the Hoffman2 Cluster, at the cluster prompt issue:

$ CURRENT_GPU_JOBS

To request multiple GPUs on nodes with A100, A6000, RTX2080Ti, GTX1080Ti, P4 or K40 cards augment the cuda value in the scheduler resource request (see scheduler option in GPU cards available on the Hoffman2 Cluster to the desired number (up to 4 on nodes with A100 cards, up to 3 on nodes with P4 cards and up to 2 on nodes with RTX2080Ti, GTX1080Ti, A6000 and K40 cards). The scheduler options reported in table GPU cards available on the Hoffman2 Cluster can be combined with other scheduler options (for a list of which see: Principal requestable resources and Principal parallel environments table), for example:

$ qrsh -l gpu,P4,cuda=1,h_rt=3:00:00

To see the specifics for a particular gpu node, at a g-node shell prompt enter:

$ gpu-device-query.sh

CUDA¶

Various CUDA versions are installed on the Hoffman2 Cluster, to see which versions of cuda are available please issue:

$ module av cuda

Note

You will be able to load a cuda module only when on a GPU node, you can however see how a cuda modulefile will change your environment by issuing:

$ module show cuda

After requesting an interactive session on a GPU node, to load a specific version, use:

$ module load cuda/<VERSION>

where VERSION is one of the versions listed in the output of: module av cuda.

CUDA Samples¶

Already compiled samples of NVIDIA GPU Computing Software Development Kit are generally available in:

$ module load cuda
$ echo $CUDA_SAMPLES

for example:

$ module load cuda
$ $CUDA_SAMPLES/deviceQuery

Problems with the instructions on this section? Please send comments here.

Monitoring resource utilization¶

While the job is running¶

Open a terminal on the Hoffman2 cluster and issue at the command line:

$ check_usage

If you have any interactive session or barch job running, the command check_usage will give you a snapshot of the current resource utilization of each job on each node on which jobs are running. The command will also inform you of the resources you have requested for each job.

After the job has completed¶

To check the scheduler accounting logs you will need to know your $JOB_ID. For example, for $JOB_ID equal to 4753410 you will use:

$ qacct -j 4753410

and inspect the maxvmem.