Main navigation

FAQ

Contents

In an interactive session (via qrsh), I am getting “not enough memory” error messages and my application is terminated abruptly. Why?

When issuing the qrsh command, one must specify the memory size via -l h_data, which is also imposed as the virtual memory limit for the qrsh session. If the application (e.g. matlab) exceeds this limit, it will be automatically terminated by the scheduler. Each application has a different error message, but usually it contains key words like “not enough memory”, “increase your virtual memory”, or something similar. In this case, you will have to re-run the qrsh command with an increased h_data value.

Please also note that requesting an excessive amount of h_data might cause the qrsh to wait for a long time, or even fails to start, because there are fewer and fewer compute nodes that can meet your criterion as you increase the h_data value. If this is your first time to launch the application in a qrsh session, we recommend gradually increase h_data until the application runs successfully.

Why is my job still waiting in the queue?

The following factors may contribute to longer wait time, or jobs not starting (depending on your account’s access level):

  • Larger memory request, e.g. (h_data) or (h_data)*(-pe shared)
  • Longer run time (h_rt)
  • Specific CPU model (arch)
  • Many CPUs (-pe dc*)
  • You are already running on some numbers of CPUs or nodes
  • For high priority jobs (-l highp):
    • Your group members are already running on your purchased nodes; there are not enough left for your job to start
    • Your request exceeds what your group nodes have
  • Hoffman2 cluster’s load

Which password do I use to login?

To log in Hoffman2 cluster, use the Hoffman2 login and password. Not UCLA Grid Portal login/password. Not the UCLA Logon/password.

As a user of the Hoffman2 Cluster, you will get two independent passwords:

(1) Hoffman2 login ID and password.

Use your Hoffman2 login ID and password when you use an ssh client on your local machine to connect to a login node, for example:
ssh -X login_id@hoffman2.idre.ucla.edu

(2) UCLA/UC Grid Portal username and password.

Use your UCLA/UC Grid username and password

Your Hoffman2 login ID and password is independent of your Grid username and password.

There is only one Grid password which is used by both the UCLA Grid Portal and the UC Grid Portal. If you request that the password you use for one of the grid portals be changed, you will have to use your new password when you login to either grid portal.

These logins and passwords are independent of the UCLA Logon ID and password. You are sometime asked to authenticate with your UCLA Logon ID and password when requesting services via the web.


I’m still having password problems

Please see How to Change your Hoffman2 Cluster password or shell If that doesn’t fix it, please send email to accounts @ idre.ucla.edu


My program writes lot of scratch files in my home directory. This results in exceeding my disk space quota. What is the solution?

There are several things you can do:

  • If you are a member of a research group which has contributed nodes to the Hoffman2 Cluster, your PI can purchase additional disk space for use by the members of your group.
  • Each process in your parallel program can write to the local /work on the node it is running on. When the program finishes, you can copy the files off to a place where you have more space. Since /work is local to the nodes, using it is very efficient.
  • You can write to /u/scratch and you have 7 days after the job completes to copy the files somewhere else.

How do I transfer my files from the Hoffman2 Cluster to my machine?

If the size of an individual file does not exceed 100 MB, you can download it to your local machine, or transfer it to another cluster that you can access at UCLA from the UCLA Grid Portal.

For any size file, you can use the scp command to transfer a file or directory from one machine or system to another. For saftey reasons, as outlined in the Security Policy for IDRE-Hosted Clusters, always scp from your machine to the IDRE-Hosted cluster. NEVER scp from the IDRE-Hosted cluster back to your local machine.


Is there a simpler way to copy all my files to my new Hoffman2 account?

Once you have been notified that your login ID has been added to the Hoffman2 Cluster, login to your local machine and from your local machine’s home directory enter the command:

        tar -clpzf - * | ssh loginid@hoffman2.idre.ucla.edu tar -xpzf -

Replace loginid with your Hoffman2 Cluster loginid.

Note that this transfer will not copy any of the hidden (dot) files from your local home directory to your new home directory on the Hoffman2 Cluster. Since many of the dot files in your home directory are operating system version specific, it would not be appropriate or useful to transfer these files.


An IDRE consultant sent me an email about a lot of left over jobs running under my userid. How do I delete them?

You can get the processor id’s using the ps command and filter them using the grep command to select only the jobs you want to delete and feed the result to kill command.

ps -u loginid | grep myjob | awk '{print $1}' | xargs
ps -u loginid | grep myjob | awk '{print $1}' | xargs kill

Replace loginid with your loginid and myjob with the executable name.


I have a lot of jobs in error state E. How do I find out what the problem is?

When the myjobs script or qstat -u loginid shows you have jobs in an error state (“E”, “Eqw”, etc.) you can use the error_reason script to show you why. It will print the error reason line from qstat -j jobid output for all of your jobs that are in an error state.

error_reason -u loginid

Replace loginid with your loginid.


How do I print my output?

There is no printer directly associated with the Hoffman2 Cluster. If you have a printer attached to your local desktop machine, you can copy your file to your local machine and print your file locally. Recall that for security reasons you should issue the scp command from your local machine, and not from the Hoffman2 command line.

Here is a little script that you could save on a unix/linux machine that might make printing a text file easier. You might name this script h2print

scp loginid@hoffman2.idre.ucla.edu:$* .
lpr $*

where loginid is your Hoffman2 Cluster login ID. You can omit loginid@ if your userid on your local machine is the same as your Hoffman2 Cluster login ID. Note the period (.) at the end of the scp command line. Mark the script as executable with the chmod command:

chmod +x h2print

To print a Hoffman2 text file in your home directory, from your local machine’s command prompt, enter:

h2print hoffman2_filename

where hoffman2_filename is the name of your text file on the Hoffman2 Cluster that you want to print.

The scp command will prompt you for your Hoffman2 Cluster password, unless you have previously setup an rsa key pair on your local machine with thessh-keygen -t rsa command, and appended a copy of the public key (id_rsa.pub) to ~/.ssh/authorized_keys on your Hoffman2 Cluster account.


What queues can I run my jobs in?

The qquota command will tell you what resources available to your userid are in use at the moment that the qquota command was run. The purpose of qquota is not to provide a complete list of the resources available to your userid. If no resources are in use at the moment, qquota will not return any information.

For example:

resource quota rule limit                filter
--------------------------------------------------------------------------------
rulset1/10         slots=123/256        users @campus hosts @idre-amd_01g

“slots=123/256” means 123 slots or cores are in use by your group out of 256 of your group’s total allocation. Enter man qquota at the shell prompt for more information.


When will my job run?

The qstat command will list all the jobs which are running (r) or waiting to run (qw), in order by priority (“prior” column). If all jobs requested the same resources, this would also be the order in which they start running. In reality, some jobs will request more nodes or a longer run time which is not presently available, so the job scheduler will “back-fill” and try to start jobs which require fewer resources that will complete without slowing down the start time of a job higher in the list.

If you are in a research group which has purchased nodes for the Hoffman2 Cluster, you can use the highp complex to request that your job run on your group’s highp resources. It is guaranteed that some job submitted by someone in your research group will start within 24 hours. To see where your highp job is with respect to the waiting jobs that everyone else in your group has submitted, you can use the groupjobs script. It will display a list of pending jobs, or pending and running jobs, similar to regular qstat output but only for everyone in your resource group. The job at the top of the list will in most cases start running before those later in the list. For help and a list of options, enter groupjobs -h


What is my disk storage quota and usage?

From the UCLA Grid Portal, you can use its “Disk Usage on Hoffman2” application. Click:

Job Services
Applications
Disk Usage on Hoffman2
Submit Job button

You do not have to make any changes on the application form in order for it to report on your home directory usage. View your job results as usual. Click:

Job Services
Job Status

After your job has completed and its status is Done, click the Stdout link in the Output column for your job. Your request runs as a job on Hoffman2 and will send you standard Sun Grid Engine job status email.

From the Hoffman2 Cluster login nodes, at the shell prompt, enter:

myquota

The myquota command will report the usage and quota for filesystems where your userid has saved files, including /u/scratch as well as your home directory. Use the myquota command instead of the quota command. The myquota command supports the BlueArc storage system used by the Hoffman2 Cluster.


Re-compiling for CentOS 6?

The new OS includes a new version of the GNU compiler (gcc v. 4.4.4) and python (v. 2.6.5). Accordingly, any executable built against or depending in any way from gcc and python may need to be recompiled. Our default compiler is Intel but if you depend on gcc be aware that we are now supporting only version 4.4.4 (and the openmpi libraries version 1.4.4 built with this compiler). Also we now support solely python version 2.6.5 and most of the third party extension packages are being recompiled accordingly. If you need some specific python module which is not present let us know. Likewise we have attempted to maintain the system as close as possibly to what it was, however, you could expect some library dependencies to be broken as most libraries have substantially changed in this new OS version.


Why my highp job did not start in 24 hours?

A highp job will start in 24 hours provided that your group does not overuse purchased resources (see also/computing/policies#highp).

The common reasons a highp job did not start in 24 hours are:

  • 1. You did not specify the highp option in your job script.
    Check your job script, look for a line that starts with #$ -l. highp should be one of a parameter. For example, the line should look like:
    #$ -l h_data=1G,h_rt=48:00:00,highp
  • 2. The pending job in question does not have highp option. (See below about how to check this.)
  • 3. Members of your group are already running long jobs on the purchased compute nodes.
    In this case, your highp job will be queued until resources become available. (You still need to add “highp” to the job script described above.)
  • 4. Your research group is not a Hoffman2 shared cluster program participant.
    Consider join the program and enjoy the benefits.
  • 5. The of h_data and number of slots is greater than the per-node memory size of your group nodes.
    For example, you have h_data=8Gand-pe shared 7. This means you are requesting a node with 56 GB (=8G*7) of memory. If each of your group’s nodes has, say, 32GB of memory, your highp job will not start.

To check whether your pending job has the highp option, use the following commands and steps:

  • 1. Find out job ID (of the pending job):

    qstat -s p -u $USER

  • 2. Check if highp is specified for the job in question:

    qstat -j job_id | grep ^'hard resource_list' |grep highp

    If you see no output from the command above, it means that job does not have highp option. You need to specify highp. See below about how to use qalter command to fix this.
    If you see something like:

    hard resource_list: h_data=1024M,h_rt=259200,highp=TRUE

    This means the job does have highp option specified.

To alter (without re-submitting it) a already-pending job from non-highp to highp, use following steps:

  • 1. Get the “hard resource” parameter list:

    qstat -j job_id | grep ^'hard resource_list'

    For example, you have hard resource_list: h_data=1024M,h_rt=259200
    You will use the list beyond the colon (“:”) in the “hard resource_list” output above in the next step.

  • 2. Add the highp option to the hard resource_list using the qaltercommand:

    qalter -l h_data=1024M,h_rt=259200,highp job_id

    where job_id is replaced by the actual job ID (number). For more information about qalter, try the command: man qalter .


How much virtual memory should I request in job submission?

It is important to request the correct amount of memory size when submitting a job. If the request is too small, the job may be killed at run time due to memory overuse. If the request is too large (e.g. larger than the compute nodes you intend to run the job), the job may not start.

The followings are a few common techniques that can help you determine the virtual memory size of your program.

  1. If you have successfully run your job, run the command

    qacct -j job_ID

    Look for the maxvmem value. This is the virtual memory size that your program consumed as seen by the scheduler. Specify h_rt so that (h_rt)*(number of slots) is no less than this value. For example, if maxvmem shows 11 GB, you can request 12 GB of memory on a compute node to run the job, such as one of the followings:

    • -l h_data=12GB for a single-core run (if your program is sequential)
    • -l h_data=6GB -pe shared 2 for a 2-core run (if your program is shared-memory parallel)
    • -l h_data=2GB -pe shared 6 for a 6-core run (if your program is shared-memory parallel)

    Note that for this example, the product of (h_data)*(number of slots) is always 12GB. If you specify -l h_data=12GB -pe shared 6, you are actually requesting 12GB*6=72GB of memory on a node. Such job may not start at all because there is no such nodes for common use. (There are a few high memory nodes but their use require other arrangement.) Note: If you are running multiple slots on a node, (h_rt)*(number of slots) needs to be smaller than the total memory size of your nodes.

  2. If you are not sure about the virtual memory size, run your program in “exclusive” mode first. Once done, use Method 1 above to determine the virtual memory size. To submit a job in exclusive mode, qsub the job with the command

    qsub -l exclusive your_job_script

    where your_job_script is replaced by the actual file name of your job script. In this case, you should also specify h_data for node selection purposes. If you are running sequential or shared-memory parallel program (i.e. using only one compute node), we recommend using h_data=32GB and without specifying the number of slots. You can also append the exclusive option to the line starting with “#$ -l” in your job script, e.g.

    #$ -l h_rt=24:00:00,h_data=32G,exclusive

    Again, if your program is sequential or shared-memory parallel, DO NOT specify the number of slots (i.e. there should be no “-pe” option in your job script or command line, otherwise you may over-request memory causing the job unable to start).


How do I pack multiple job-array tasks into one run?

Using job array is a way to submit a large number of similiar jobs. In some cases each job task takes only a few minutes to compute. Running a large number of extremely short jobs through the scheduler is very inefficient — the system is likely to be more busy finding a node, sending jobs in and out, than doing the actual computing. With a simple change of your job script, you can pack pack multiple job-array tasks into one run (or dispatch), so you can benefit from the convenience of using job arrays and at the same time use the computing resources efficiently.

If you run too many short jobs (e.g. more than 200 less-than-3-minute jobs within an hour), your other pending jobs may be temporarily throttled. Please understand that this is a way to ensure the scheduler’s normal operation, not intended to cause user inconveniences.

At run time, the environment varialbe $SGE_TASK_ID uniquely identified a task. The main ideas to pack multiple tasks into one run with minimum change to your job script are to:

  • 1. change the job task step size.
  • 2. create a loop inside the job script to execute multiple tasks (equal to the ‘step size’).

Of course, you may need to adjust h_rt to allocate sufficient wall-clock time to run the ‘packed’ version of job script.

csh/tcsh example

Your original job script looks like

#!/bin/csh
...
#$ -t 1-2000
...
./a.out $SGE_TASK_ID ...

To pack, say, 100 tasks into one run, change your job script to:

#!/bin/csh
...
#$ -t 1-2000:100
...
foreach i (`seq 0 99`)
   @ my_task_id = $SGE_TASK_ID + $i
   ./a.out $my_task_id ...
end

bash/sh example

Your original job script looks like

#!/bin/bash
...
#$ -t 1-2000
...
./a.out $SGE_TASK_ID ...

To pack, say, 100 tasks into one run, change your job script to:

#!/bin/bash
...
#$ -t 1-2000:100
...
for i in `seq 0 99`; do
   my_task_id=$((SGE_TASK_ID + i))
   ./a.out $my_task_id ...
done

How do I request large memory to run sequential (1-core) program?

If you are requesting less than 64 GB, use the h_data to specify the requested memory size, e.g.

qsub -l h_data=32G ...

You can also put -l h_data=32G in your job script file.

In this case, you are requesting a single core (slot), so you should not specify any -pe option.

If you are requesting more than 64GB, please contact us.


How do I request large memory to run multi-threaded (single node) program?

You will use -pe (number of cores) and -l h_data (memory per core) together to specify the total amount of memory you want. Note that the product of (number of cores)*(h_data) must be smaller than the total memory of a compute node, otherwise your job will never start.

For example, request 8 cores with 32G total memory (shared by all 8 cores):

qsub -l h_data=4G -pe shared 8 ...

If your multi-threaded program will automatically use all CPUs available on the node, add the -l exclusive option, e.g.

qsub -l h_data=4G,exclusive -pe shared 8 ...

You can also put -pe shared 8 -l h_data=32G in your job script file.

If you are requesting more than 64GB total memory, please contact us.


How to load certain applications in your path / How to set up your environment

In Unix-like system the process that interacts with a user (or a user command), called the shell, maintains a list of variables, called environmental variables, and their values. For example in order to find an executable users should add its path to their $PATH variable.

Users can permanently add certain values to their shell environmental variables by editing their shell customization files (such as: .bash_profile, .profile, etc.) located in their $HOME directories.

Alternatively Hoffman2 users can dynamically change their shell environment using the environmental modules utility.


How to use environmental modules interactively

Users can load a certain application/compiler in their environment (e.g.: $PATH, $LD_LIBRARY_PATH, etc.) by issuing the command:

module load application/compiler

where application/compiler is the name of modulefile relative to the application/compiler (for example: matlab, intel, etc.).

To see a list of available modulefiles relative to applications/compilers users should issue the command:

module av

to learn about the application/compiler loaded by a certain module issue:

module whatis application/compiler

or:

module help application/compiler

to see how a module for a certain application/compiler will modify the user’s environment issue:

module show application/compiler

to check which modules are loaded issue:

module li

to unload a certain application/compiler previously loaded from one’s environment issue:

module unload application/compiler

for a full list of module commands issue:

module help

Users are encouraged to write their own modulefiles to load their own applications, you can learn how to do so here.


Before June 19, 2013, my home directory used to be in a different place. How can I get my old files?

During the Summer Maintenance June 17-19, 2013, users whose resource groups have purchased additional storage had their home directories relocated to the current /u/home/u/username directory structure. There is a convenient symlink called “project” in your current home directory (e.g. /u/home/u/username/project), pointing to your old home directory prior to June 19, 2013.

Unlike your purchased storage, your current home directory has a quota of 20GB. It is a place to store your source code, papers, etc, but probably not suitable for large data sets. Big files should go either in ~/project or in $SCRATCH (periodically purged).

To go to your old home directory,

cd ~/project

You can copy the “dot files” (e.g. for shell initialization) from ~/project to the current home directory, e.g.

cp ~/project/.bashrc ~
cp ~/project/.profile ~

Or copy directories from ~/project to the current home directory, e.g.

cp -r ~/projects/.gnome/ ~

You’ll want to review your job submission scripts and make sure all the directory paths are pointing to the right locations. Depending on the file and the complexity, you’ll either edit them in place, or rebuild them. You may rename or delete the convenient symlink “project” if you wish.For more information, see Data Storage.


Default environment before October 1st, 2013

The default compiler and message passing interface library before October 1st, 2013 consisted of:

  • Intel compiler version 11.1
  • OpenMPI version 1.4

this version of the Intel compiler and MPI library can be set in one’s environment by loading the modulefiles:

module load intel/11.1
module load openmpi/1.4

(see also how to use environmental modules in batch jobs).

Any application compiled before October 1st, 2013 using the default version of the Intel compiler and of the MPI library will still run on the cluster provided that the intel/11.1 module (and where needed the openmpi/1.4 module) are loaded into the user environment. To submit batch jobs of parallel applications compiled before October 1st, 2013 users should use the openmpi.q queue script.

Applications and libraries available under intel/11.1 and openmpi/1.4 hierarchies are still available and can be loaded via their modulefiles after the intel/11.1 and openmpi/1.4 modules are loaded.


Default environment after October 1st, 2013

The default compiler and message passing interface library after October 1st, 2013 consists of:

  • Intel compiler version 13.1
  • IntelMPI version 4.1

this version of the Intel compiler and MPI library can be set in one’s environment by loading the modulefile:

module load intel/13.cs

Note that the module intel/13.cs will load the Intel Cluster Studio which includes the IntelMPI library (no separate module needs to be loaded for this version of MPI).

To submit batch jobs of any parallel application compiled with the default Intel compiler and the default MPI library after October 1st, 2013, users should use the intelmpi.q queue script.


How to load in your environment the previously used Intel compiler

To load the Intel version 11.1 and OpenMPI version 1.4 in your environment issue the commands:

module load intel/11.1
module load openmpi/1.4

(see also how to use environmental modules in batch jobs).


Why cannot I submit too many individual jobs?

When there are too many pending jobs, the scheduler may fail to process all them, causing scheduling problems. Therefore, to maintain stability, the system has a limit on how many jobs a user can submit. This limit is usually in the hundreds, and may vary depending on the system’s load.

Most users who submit a huge number of individual jobs should consider using job arrays for one obvious benefit: one job-array job can hold thousands of “tasks” (or individual “runs”) and consume only one (1) job out of the user’s number of jobs limit. A user can then submit hundreds of job arrays (each containing thousands of “runs”). This usually can cover some of the largest “through-put” runs on the cluster.

If each individual task is very short (e.g. finish in a few minutes), users should pack several tasks into one run to increase throughput efficiency. See this FAQ for more details. Running a large number of short jobs is a severe waste of the cluster’s computing power.

For more information about job array, see this page.


How can I acknowledge Hoffman2 Shared Cluster in my presentations or publications?

See this page.


UCLA Grid Portal will be taken down in near future?

UCLA grid portal software is no longer under active development. Although it is working fine, we don’t expect all the software that built Grid Portal to have continued compatiblity with latest version of essential software packages and operating system to run the portal for a long time. Therefore, this service will be discontinued in near future. IDRE is working on alternate solutions to provide similar services.


What is “fork: retry: Resource temporarily unavailable”?

On the login nodes each user account is limited to a certain number of running processes. The message “… fork: retry: Resource temporarily unavailable”, means you have reached this limit. To circumvent this, you can get an interactive session (on a compute node with desired resources in terms of memory and/or processors, which is not subject to this limit.) and do your work there. See also role of the login nodes.


After loading Intel 13.cs module, why is mpicc/mpicxx/mpif90 still not using Intel compiler?

Please note that Intel MPI compilers have un-conventional names. After loading Intel 13.cs module (module load intel/13.cs), in order to use Intel compiler to compile MPI programs, you must use

  • mpiicc for C programs
  • mpiicpc for C++ programs
  • mpiifort for Fortran programs

If you use the conventional mpicc, mpicxx or mpif90, GNU compilers are used. See Intel MPI compilers for details.


When submitting a job, I get “Unable to run job: got no response from JSV script…”.

This could happen when the scheduler (software) is too busy handling jobs. One way to overcome this problem is to add the following line at the bottom of your ~/.bashrc to increase the default timeout limit:

export SGE_JSV_TIMEOUT=60

Then run “source ~/.bashrc” (or just log out and log in), and try to submit your job again.



How to download SGE job scripts?

Go to this page.

Report Typos and Errors
UCLA OIT

© 2016 UC REGENTS TERMS OF USE & PRIVACY POLICY