Main navigation

Getting an interactive session

Normally, you would submit your jobs to the scheduler to be executed on the compute nodes (using the “qsub” command). However, at times it is convenient to do certain things interactively. For this purpose, some compute nodes are made available exclusively (certain limitations apply; see below) for interactive use via the “qrsh” command.

Due to the high load on the scheduler server, you may have to wait for a few minutes to get the interactive session using the commands described below. Thank you for your patience.

Basic Usage

To get a longer interactive session for, e.g. 4 hours:

qrsh -l h_rt=4:00:00,h_data=4G

To request more than one CPU cores (for example, 4 CPU cores for 8 hours), use:

qrsh -l h_rt=8:00:00,h_data=4G -pe shared 4

Or if your job/program can run across multiple nodes (e.g. MPI), use:

qrsh -l h_rt=8:00:00,h_data=4G -pe dc\*  4

A word about h_data

When invoking an interactive session with qrsh the proper memory size needs to be specified via “h_data”. Any job, started in an interactive session, attempting to use more memory than what had been requested with h_data will be automatically terminated by the scheduler. If you are unsure of what amount of memory is appropriate for your interactive session you could add “-l exclusive” to your qrsh command. In that case, the h_data value is used by the scheduler to select a compute node having a total amount of memory equal or greater than what specified with h_data. In this case the memory limit for the job is the compute node’s physical memory size. For example the command:

qrsh -l h_rt=8:00:00,h_data=32G,exclusive

will start an interactive session on a compute node equipped with at least 32G of physical memory. Of course you can only request as much memory as what is available on nodes on the cluster.  qrsh jobs submitted without specifying h_data are automatically assigned an h_data=1G, which may or may not be too small for your application.

Resource limitation

  1. You can specify up to 24 hours (-l h_rt=24:00:00) for an qrsh session.
  2. Hoffman2 cluster’s compute nodes have different memory sizes. When you use -pe shared, if the product of -pe and h_data is larger than 24G (e.g. for -pe shared 4 -l h_data=8G, the product is 4*8G=32G), you are limited to a smaller subset of compute nodes. If the product is more than 32G or 48G, you are limited to an even smaller subset. If the product is more than 64G, you may not get an interactive session.
  3. When you request more than 4 cores, you may or may not get the interactive session immediately, depending on how busy the cluster is and the permission level of your account.

Running on your group’s nodes

This section does not apply to you if your research group has not purchased Hoffman2 compute nodes.

To run on your group nodes, add the -l highp option, e.g.

qrsh -l highp,h_rt=48:00:00,h_data=4G

You could also request multiple cores using the -pe dc\* or -pe shared as described above. When combining with -l highp, the amount of cores or memory size need to be compatible with your group compute nodes. Contact user support if you are not sure.

Although you are allowed to specify h_rt as high as 336 hours (14 days) for an qrsh session, it is not recommended. For example, if the network connection is interrupted (e.g. your laptop or desktop computer goes into sleep mode), the qrsh session may be lost, possibly terminating all running programs within it.

qrsh examples

Note: The parameters associated with the -l directive are separated by commas without any white space in between.

  • Request a single processor for 2 hours.
    The default memory size depends on the queue in which your session starts. For campus users, the default memory size is 1GB. Use the -l directive with the h_data parameter to request more memory.qrsh
  • Request a single processor for 24 hours from the interactive queues.
    qrsh -l h_rt=24:00:00,h_data=1G
  • Request 8 processors for 4 hours (total 8*1G=8GB memory) on a single node from the interactive queues.
    qrsh -l h_data=1G,h_rt=4:00:00,h_data=1G -pe shared 8
  • Request 4 processors for 3 hours (total 4*1G=4GB memory) on a single node.
    qrsh -l h_data=1G,h_rt=3:00:00,h_data=1G -pe shared 4
  • Request 12 processors, 1GB of memory per processor, for 2 hours. The 12 CPUs are distributed across multiple compute nodes. The backslash “\” in “dc\*” is significant when you issue this command in an interactive unix shell: qrsh -l h_data=1G,h_rt=2:00:00 -pe dc\* 12

qrsh startup time

A qrsh session is scheduled along with all other jobs managed by the scheduler software. The shorter time (the -l h_rt option), and the fewer number of processors (the -pe option), the better chance you have of getting a session. Request just what you need for the best use of computing resources. Be considerate to other users: exit your qrsh session when you are done, to release the computing resources to other users.

Interpreting error messages

Occasionally, you may encounter one of the following messages:

error: no suitable queues

or,

qrsh: No match.

 

If you see the no suitable queues message and you are requesting the interactive queues, be sure you have not requested more than 24 hours. This message may mean there is something incompatible with the various parameters you have specified and your qrsh session can never start, for example if you have requested -l h_rt=25:00:00 but your userid is not authorized to run sessions or jobs for more than 24 hours.

If your session could not be scheduled, first try your qrsh command again in case it was a momentary problem with the scheduler.

If your session still cannot be scheduled, try lowering either the value of h_rt,  the number of processors requested, or both, if possible.

Running MPI with qrsh

The following instructions are specific to IntelMPI library. They may not apply to other MPI implementations (see: How to run MPI for more information).

There are 2 main steps to run an MPI program in a qrsh session. You need to do step #1 only once per qrsh session. You can repeatedly execute step #2 within the same qrsh session. The executable MPI program is named foo in the following example.

Set up the environment

 

In the qrsh session at the shell prompt, enter one of the following commands:

  • If you are in bash shell:

source /u/local/bin/set_qrsh_env.sh

  • If you are in csh/tcsh shell:

source /u/local/bin/set_qrsh_env.csh

Launch your MPI program

Assume your MPI program is named foo and is located in the current directory. Run the program using all allocated processors with the command:

mpiexec.hydra -n $NSLOTS ./foo

You could replace $NSLOTS with an integer which is less than the number of processors you requested on your qrsh command. For example:

mpiexec.hydra -n 4 ./foo

The command to see mpiexec options is:

mpiexec.hydra -help

You do not have to create a hostfile and pass it to mpiexec.hydra with its -machinefile or -hostfile option because mpiexec.hydra automatically retrieves that information from UGE.

Additional tools

Additional scripts are available that may help you run other parallel distributed memory software. You can enter these commands at the compute node’s shell prompt.

get_pe_hostfile

Returns the contents of the UGE pe_hostfile file for the current qrsh session.If you have used the -pe directive to request multiple processors on multiple nodes, you will probably need to tell your program the names of those nodes and how many processors have been allocated on each node. This information is unique to your current qrsh session.

To create an MPI-style hostfile named hfile in the current directory:
get_pe_hostfile | awk '{print $1" slots="$2}' > hfile
The UGE pe_hostfile is located:

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.1/pe_hostfile

or,

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.sge_taskid/pe_hostfile

where node and sge_jobid are the hostname and UGE $JOB_ID respectively of the current qrsh session.
sge_taskid is the task number of an array job $SGE_TASK_ID.

get_sge_jobid

Returns the value of UGE JOB_ID for the current qrsh session.

get_sge_env

Returns the contents of the UGE environment file for the current qrsh session. Used by the set_qrsh_env scripts.

UGE-specific environment variables are defined in the file:

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.1/environment

or,

$SGE_ROOT/$SGE_CELL/spool/node/active_jobs/sge_jobid.sge_taskid/environment

where node and sge_jobid are the hostname and UGE $JOB_ID, respectively, of the current qrsh session. sge_taskid is the task number of a array job $SGE_TASK_ID.

Report Typos and Errors
UCLA OIT

© 2016 UC REGENTS TERMS OF USE & PRIVACY POLICY