Frequently asked questions¶

Frequently asked questions and their answers are organized by the following categories:

Getting help
Accounts
Acknowledging the Hoffman2 Cluster
Applications, compilers and libraries
Connecting, Authentication, SSH public-keys
Data transfers
Job Errors/ Job Scheduler
Storage and File systems
Other

Getting help¶

Note

If you do not find answers to your question or issue, please open a support ticket via our online help desk.

For faster resolution, please provide helpful details, e.g. your user name, the relevant files and directories and whether you grant technicians access to any of them, any jobs scripts or job IDs affected and what steps are necessary to reproduce the issue.

Accounts¶

Accounts are created and managed via the Hoffman2 Cluster System Account Management (SIM) at: https://sim.idre.ucla.edu.

How do I create an account?¶

Visit our Requesting an account section to make sure you qualify. You can then request an account via our SIM at: https://sim.idre.ucla.edu.

What is the status of my user account application?¶

Most likely your application is pending sponsor approval. You may want to consider asking your sponsor to approve your application by logging into the SIM new account page at: https://sim.idre.ucla.edu/sim/account/new.

I no longer need my user account. What should I do?¶

If you no longer need your user account, you can send an email to accounts@idre.ucla.edu requesting its deletion.

Questions or comments? Visit our support online help desk at: https://support.idre.ucla.edu.

My SSH client says: Permission denied, please try again¶

In the example below user joebruin is having a problem connecting via ssh to the Hoffman2 Cluster:

$ ssh hoffman2.idre.ucla.edu -l joebruin
joebruin@hoffman2.idre.ucla.edu's password:
Permission denied, please try again.

A permission denied warning could be due to several reasons:

Verify you are using your cluster username (Hoffman2 Cluster usernames are limited to 8-characters) and password

you can check your username and change your password by logging into the My Account page of SIM at: https://sim.idre.ucla.edu/sim/account/view

To change your password follow the link Change the password for <USERNAME> on the H2 cluster on the My Account page of SIM.

The system may not be accepting logins due to a scheduled maintenance (check your email for maintenance notification or https://www.hoffman2.idre.ucla.edu/).
If you continue to have problems, please submit a support ticket on our online help desk at: https://support.idre.ucla.edu/helpdesk/.

Questions or comments, visit our support online help desk at: https://support.idre.ucla.edu.

Getting access to project folders in a different research group¶

In order to get access to another research group’s purchased project storage volume, your cluster account will need to be a member of their Unix group. Please open a support ticket via our online help desk at: https://support.idre.ucla.edu and include your cluster username and the full path to the project folder to request access.

I would like to collaborate with another Hoffman2 user. How can I share data with them?¶

Warning

We actively monitor and make sure that $HOME directories are only accessible by the users who own them (this is done to prevent data loss). Data sharing is therefore only possible in $SCRATCH or on specifically created directories in group owned project space. We discourage giving access to any user of the cluster and we support data sharing across users who belong to a common Unix group (which may be specifically created for the purpose of data/applications sharing).

Users on the cluster are organized into groups. Every user belongs to a primary group and may be in several secondary groups. You can see the list of groups you belong to with the groups command. For example, if your username were joebruin you would query the groups you are part of with:

joebruin@login2:~$ groups joebruin
joebruin : web gpu data

the command groups joebruin returns two sets of values separated by a colon, :, on the left is the username (joebruin) and, on the right are: web and gpu. web is the primary Unix group user joebruin belongs to and gpu and data are secondary Unix groups of which joebruin is part of.

In order to share data across different users on the Hoffman2 Cluster, users should belong to a common Unix group. A user may be part of several Unix groups but a file or a directory can be owned by only one owner and one group. Group membership can give you access to files and directories belonging to that group if the owner has allowed group access. For example, user joebruin can give access to user sambruin provided that the latter has at least one secondary Unix group in common with joebruin. The users can check whether they belong to a common group using the groups command:

joebruin@login2:~$ groups joebruin sambruin
joebruin : web gpu data
sambruin : acct data

Changing file ownership via chown¶

If data is a group both users are members of, so files or directories could be shared as long as they are owned by the group data and the permissions on the files/directories are set so to give the group members access. Group ownership does not imply group access; you must set the file access permissions so your group can use the files.

For example, if a shared directory exists that is accessible to all the users in the group data, user joebruin can share a file in this directory with all the members of the group (and therefore also user sambruin) by changing the file group ownership to data via the chown command:

sambruin@login2:~$ chown sambruin:data myfile.dat

Changing file access via chmod¶

If a file is group-owned by data access can be given to any other member of the group data via the command chmod:

sambruin@login2:~$ chmod g+r myfile.dat

In order to get access to another research group’s purchased project storage volume, your cluster account will need to be a member of their Unix group. Please open a support ticket via our online help desk at: https://support.idre.ucla.edu and include your cluster username and the full path to the project folder to request access.

Questions or comments, visit our support online help desk at: https://support.idre.ucla.edu.

Acknowledging the Hoffman2 Cluster¶

How can I acknowledge Hoffman2 Shared Cluster in my presentations or publications?¶

When publishing results from work carried partially or exclusively on the Hoffman2 Cluster, we appreciate you acknowledging us as follows:

"This work used computational and storage services associated with the Hoffman2 Shared Cluster provided by UCLA Office of Advanced Research Computing’s Research Technology Group."

Applications, compilers and libraries¶

How to load certain applications in your path / How to set up your environment¶

In Unix-like system the process that interacts with a user (or a user command), called the shell, maintains a list of variables, called environmental variables, and their values. For example in order to find an executable users should add its path to their $PATH variable.

Users can permanently add certain values to their shell environmental variables by editing their shell initialization files (such as: .bash_profile, .profile, etc.) located in their $HOME directories.

Alternatively Hoffman2 users can dynamically change their shell environment using the environmental modules utility.

How to use environmental modules interactively¶

Users can load a certain application/compiler in their environment (e.g.: $PATH, $LD_LIBRARY_PATH, etc.) by issuing the command:

$ module load application/compiler

where application/compiler is the name of modulefile relative to the application/compiler (for example: matlab, intel, etc.).

To see a list of available modulefiles relative to applications/compilers users should issue the command:

$ module avail

to learn about the application/compiler loaded by a certain module issue:

$ module whatis application/compiler

or:

$ module help application/compiler

to see how a module for a certain application/compiler will modify the user’s environment issue:

$ module show application/compiler

to check which modules are loaded issue:

$ module list

to unload a certain application/compiler previously loaded from one’s environment issue:

$ module unload application/compiler

for a full list of module commands issue:

$ module help

Users are encouraged to write their own modulefiles to load their own applications, you can learn how to do so here.

After loading the Intel compiler module, why is mpicc/mpicxx/mpif90 still not using Intel compiler?¶

Intel MPI compilers have unconventional names. After loading Intel compiler module, use:

mpiicc for C programs
mpiicpc for C++ programs
mpiifort for Fortran programs

Questions or comments click here.

Connecting, Authentication, SSH public-keys¶

Connecting for the first time¶

As all connections are based on a secure protocol, the first time you connect from a local computer to the Hoffman2 Cluster you will be prompted with a message similar to:

The authenticity of host 'HOSTNAME (HOST IP)' can't be established.
ED25519 key fingerprint is SHA256:lZdo2eNOmwgroOyCOXXFFdQjfQQA1vMpBxgwhGwirwY.
Are you sure you want to continue connecting (yes/no)?

Where HOSTNAME and HOST IP are the hostname and IP address of the various classes of public hosts.

Warning

Only proceed to connect if the ED25519 key fingerprint displayed in the prompted message corresponds to one of the ED25519 fingerprints listed in the Hoffman2 Cluster Public hosts hostkey fingerprints section.

If the ED25519 fingerprint displayed by your SSH client does not match one of the ED25519 fingerprints above for the Hoffman2 Cluster public hosts, when attempting to connect you will get a message similar to:

@@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the ED25519 host key has just been changed.

in this case, do not continue authentication; instead, contact us here or by email at: support@idre.ucla.edu.

Problems with this answer? Please send comments here.

Public hosts hostkey fingerprints¶

The Hoffman2 Cluster has the following classes of public, or external facing, hosts that are used to connect or to transfer data to and from the cluster:

Class	Hostname
Login nodes	`hoffman2.idre.ucla.edu`
Data transfer nodes	`dtn.hoffman2.idre.ucla.edu`
NX nodes	`nx.hoffman2.idre.ucla.edu`
x2go nodes	`x2go.hoffman2.idre.ucla.edu`

All public facing hosts have the following hostkey fingerprint:

ED25519 MD5-hex: a4:eb:80:cd:84:8d:e3:69:62:a2:4a:3c:7b:f6:6d:f7
ED25519 SHA1-hex: /vL4oZulkOMuLnA1hd0EGZx0GcI
ED25519 SHA256-hex: lZdo2eNOmwgroOyCOXXFFdQjfQQA1vMpBxgwhGwirwY
ED25519 MD5-base64:a4eb80cd848de36962a24a3c7bf66df7
ED25519 SHA1-base64:fef2f8a19ba590e32e2e703585dd04199c7419c2
ED25519 SHA256-base64: 959768d9e34e9b082ba0ec823975c515d4237d0400d6f329071830846c22af06

RSA MD5-hex: 3c:9c:67:d8:c5:a4:ae:77:07:5f:10:2f:20:4a:75:0f
RSA SHA1-hex: t+AS3JPkPxJvcsD7z63Vekcamt8
RSA SHA256-hex: kah9BJwSzrlFnVp9Tg+El2IdcCN7JgN5+Ur2RyIdvwM
RSA MD5-base64: 3c9c67d8c5a4ae77075f102f204a750f
RSA SHA1-base64: b7e012dc93e43f126f72c0fbcfadd57a471a9adf
RSA SHA256-base64: 91a87d049c12ceb9459d5a7d4e0f8497621d70237b260379f94af647221dbf03

Even though all of our public, external-facing hosts use the same ED25519 (or RSA) public hostkey, depending on the software package you use to connect to the cluster, that public key can be represented with any one of the different fingerprint hashes given in the table Hoffman2 public hostkey fingerprints.

Warning

If the fingerprint hash doesn’t match one listed above, do not continue authentication and contact us here or by email at: support@idre.ucla.edu.

Problems with this answer? Please send comments here.

Set-up SSH public-key authentication¶

Using SSH public-key authentication to connect to a remote system is a robust, more secure alternative to logging in with an account password. SSH public-key authentication relies on asymmetric cryptographic algorithms that generate a pair of separate keys (a key pair), one “private” and the other “public”. You keep the private key on your local computer you use to connect to the remote system. The public key is “public” and can be stored on each remote system in a .ssh/authorized_keys directory.

Note

You need to be able to transfer your public key to the Hoffman2 Cluster. Therefore, you must be able to login with your password to add the public key to your ~/.ssh/authorized_keys file in your home directory.

To set-up public-key authentication via SSH on macOS and Linux:

Use the terminal application to generate a key pair using the RSA algorithm.

To generate RSA keys, at the prompt, enter:

$ ssh-keygen -t rsa

You will be prompted to supply a filename (for saving the key pair) and a passphrase (for protecting your private key):

filename Press enter to accept the default filename (id_rsa)

passphrase Enter a passphrase to protect your private key

Warning

If you don’t passphrase protect your private key, anyone with access to your computer can SSH (without being prompted for the passphrase) to your account on any remote system that has the corresponding public key.

Your private key will be generated using the default filename (id_rsa) or the filename you specified and stored on your local computer in your home directory, in a subdirectory named, .ssh.

The public key will be generated using the same filename (id_rsa.pub), but will have a .pub extension added to it. The public key file is stored in the same location (~/.ssh/).

Now the public key needs to be transferred to the remote system (Hoffman2 Cluster). You can use the program, ssh-copy-id or scp to copy the public key file to the remote system. It’s preferable to use ssh-copy-id because contents of the public key are added directly to your ~/.ssh/authorized_keys file. If you use scp, then you will need to connect the remote compute and copy the contents of id_rsa.pub to the authorized_keys file manually. You will be prompted for your account password to complete the copy to the remote system.

transfer public key via ssh-copy-id

$ ssh-copy-id -i ~/.ssh/id_rsa.pub login_id@hoffman2.idre.ucla.edu

… where login_id is replaced with your cluster username.

transfer public key via scp

$ scp ~/.ssh/id_rsa.pub login_id@hoffman2.idre.ucla.edu
$ ssh login_id@hoffman2.idre.ucla.edu
$ cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
$ rm ~/id_rsa.pub

You should be able to SSH to your Hoffman2 Cluster user account from your local computer with the private key. Replace joebruin with your cluster username.

[joebruin@macintosh ~]$ ssh joebruin@hoffman2.idre.ucla.edu
Enter passphrase for key '/Users/joebruin/.ssh/id_rsa':
Last login: Mon Jul 10 06:01:17 2020 from vpn.ucla.edu

SSH public-key authentication not working?¶

Please verify the file permissions. Typically, you will want:

$HOME/.ssh directory to be 700 (drwx——)
public key ($HOME/.ssh/id_rsa.pub) to be 644 (-rw-r–r–)
private key ($HOME/.ssh/id_rsa) to be 600 (-rw——-)

Problems with this answer? Please send comments here.

Data transfers¶

When I Log in to the Globus web application, I get a “Missing Identity Information” error¶

Missing Identity Information - Unable to complete the authentication process. Your identity provider did not release the attribute(s): {{eppn}}

To resolve this issue with missing attributes not being released to 3rd parties, you will need to contact the UCLA IT Support Center. UCLA Logon ID is not a service of the IDRE Research Technology Group.

Job Errors/ Job Scheduler¶

An IDRE consultant sent me an email about a lot of left over jobs running under my userid. How do I delete them?¶

You can get the process id’s using the ps command and filter them using the grep command to select only the jobs you want to delete and feed the result to kill command.

To list the processes, use the command:

$ ps -u loginid | grep myjob | awk '{print $1}'

To kill the processes, use the command:

$ ps -u loginid | grep myjob | awk '{print $1}' | xargs kill

Replace loginid with your username and myjob with the executable name (e.g. bash or python).

Problems with this answer? Please send comments here.

In an interactive session (via qrsh), I am getting “not enough memory” error messages and my application is terminated abruptly. Why?¶

When issuing the qrsh command, one must specify the memory size via -l h_data, which is also imposed as the virtual memory limit for the qrsh session. If the application (e.g. matlab) exceeds this limit, it will be automatically terminated by the scheduler. Each application has a different error message, but usually it contains key words like “not enough memory”, “increase your virtual memory”, “cannot allocate memory” or something similar. In this case, you will have to re-run the qrsh command with an increased h_data value.

Please also note that requesting an excessive amount of h_data might cause the qrsh to wait for a long time, or even fails to start, because there are fewer and fewer compute nodes that can meet your criterion as you increase the h_data value. If this is your first time to launch the application in a qrsh session, we recommend gradually increase h_data until the application runs successfully.

Why is my job still waiting in the queue?¶

The following factors may contribute to longer wait time, or jobs not starting (depending on your account’s access level):

Larger memory request, requested with: h_data or the product of (h_data)*(-pe shared #) is large
Longer run time, requested with: h_rt
Specific CPU model, requested with: arch
Many CPUs, requested with: -pe dc* #
You are already running on some numbers of CPUs or nodes
For high priority jobs, requested with: -l highp
- Your group members are already running on your purchased nodes; there are not enough left for your job to start
- Your request exceeds what your group nodes have
Hoffman2 cluster’s load

Problems with this answer? Please send comments here.

I have a lot of jobs in error state E. How do I find out what the problem is?¶

When the myjobs script or qstat -u $USER shows you have jobs in an error state (“E”, “Eqw”, etc.) you can use the error_reason script to show you why. It will print the error reason line from qstat -j jobid output for all of your jobs that are in an error state.

$ error_reason -u loginid

Replace loginid with your username.

What queues can I run my jobs in?¶

The qquota command will tell you what resources available to your userid are in use at the moment that the qquota command was run. The purpose of qquota is not to provide a complete list of the resources available to your userid. If no resources are in use at the moment, qquota will not return any information.

For example:

resource quota           rule limit           filter
rulset1/10         slots=123/256        users @campus hosts @idre-amd_01g

where: slots=123/25 means 123 slots or cores are in use by your group out of 256 of your group’s total allocation. Enter man qquota at the shell prompt for more information.

When will my job run?¶

The qstat command will list all the jobs which are running (r) or waiting to run (qw), in order by priority (“prior” column). If all jobs requested the same resources, this would also be the order in which they start running. In reality, some jobs will request more nodes or a longer run time which is not presently available, so the job scheduler will “back-fill” and try to start jobs which require fewer resources that will complete without slowing down the start time of a job higher in the list.

If you are in a research group which has purchased nodes for the Hoffman2 Cluster, you can use the highp complex to request that your job run on your group’s purchased nodes. It is guaranteed that some job submitted by someone in your research group will start within 24 hours. To see where your highp job is with respect to the waiting jobs that everyone else in your group has submitted, you can use the groupjobs script. It will display a list of pending jobs, or pending and running jobs, similar to regular qstat output but only for everyone in your resource group. The job at the top of the list will in most cases start running before those later in the list. For help and a list of options, enter groupjobs -h.

Problems with this answer? Please send comments here.

Why did my `highp` job not start within 24 hours?¶

A highp job will start in 24 hours provided that your group does not overuse purchased resources.

The common reasons a highp job did not start in 24 hours are:

You did not specify the highp option in your job script.

Check your job script, look for a line that starts with #$ -l; highp should be one of a parameter. For example, the line should look like:
```
#$ -l h_data=1G,h_rt=48:00:00,highp
```
The pending job in question does not have highp option. (See below about how to check this.)
Members of your group are already running long jobs on the purchased compute nodes. In this case, your highp job will be queued until resources become available. (You still need to add highp to the job script described above.)
Your research group is not a Hoffman2 shared cluster program participant. Consider join the program and enjoy the benefits.
The of h_data and number of slots is greater than the per-node memory size of your group nodes.

For example, you have h_data=8G and -pe shared 7. This means you are requesting a node with 56 GB (=8G*7) of memory. If each of your group’s nodes has, say, 32GB of memory, your highp job will not start.

To check whether your pending job has the highp option, use the following commands and steps:

Find out job ID (of the pending job):
```
$ qstat -s p -u $USER
```

Check if highp is specified for the job in question:

$ qstat -j job_id | grep ^'hard resource_list' |grep highp

If you see no output from the command above, it means that job does not have highp option. You need to specify highp. See below about how to use qalter command to fix this.

If you see something like:

$ hard resource_list: h_data=1024M,h_rt=259200,highp=TRUE

This means the job does have highp option specified.

To alter (without re-submitting it) a already-pending job from non-highp to highp, use following steps:

$ qalter -mods l_hard highp true  job_id

For more information about qalter, use the command:

$ man qalter

Problems with this answer? Please send comments here.

How much virtual memory should I request in job submission?¶

It is important to request the correct amount of memory size when submitting a job. If the request is too small, the job may be killed at run time due to memory overuse. If the request is too large (e.g. larger than the compute nodes you intend to run the job), the job may not start.

The followings are a few common techniques that can help you determine the virtual memory size of your program.

If your job has completed, run the command:

$ qacct -j job_ID

Look for the maxvmem value. This is the virtual memory size that your program consumed as seen by the scheduler. Specify h_data so that (h_data)*(number of slots) is no less than this value. For example, if maxvmem shows 11 GB, you can request 12 GB of memory on a compute node to run the job, such as one of the followings:

$ -l h_data=12GB # for a single-core run (if your program is sequential)
$ -l h_data=6GB -pe shared 2 # for a 2-core run (if your program is shared-memory parallel)
$ -l h_data=2GB -pe shared 6 # for a 6-core run (if your program is shared-memory parallel)

Note

In these examples, the product of h_data * (number of slots) is always 12GB. If you specify -l h_data=12GB -pe shared 6, you are actually requesting 12GB*6=72GB of memory on a node.

Note

If you are running multiple slots on a node, h_rt * (number of slots) needs to be smaller than the total memory size of your nodes.

If you are not sure about the virtual memory size, run your program in “exclusive” mode first. Once done, use Method 1 above to determine the virtual memory size. To submit a job in exclusive mode, qsub the job with the command:

$ qsub -l exclusive your_job_script

where your_job_script is replaced by the actual file name of your job script. In this case, you should also specify h_data for node selection purposes. If you are running sequential or shared-memory parallel program (i.e. using only one compute node), we recommend using h_data=32GB and without specifying the number of slots. You can also append the exclusive option to the line starting with #$ -l in your job script, e.g.:

#$ -l h_rt=24:00:00,h_data=32G,exclusive

Again, if your program is sequential or shared-memory parallel, DO NOT specify the number of slots (i.e. there should be no -pe option in your job script or command line, otherwise you may over-request memory causing the job unable to start).

Problems with this answer? Please send comments here.

How do I pack multiple short tasks of a job array?¶

Using job array is a way to submit a large number of similar jobs. In some cases each job task takes only a few minutes to compute. Running a large number of extremely short jobs through the scheduler is very inefficient — the system is likely to be more busy finding a node, sending jobs in and out, than doing the actual computing. With a simple change of your job script, you can pack pack multiple job-array tasks into one run (or dispatch), so you can benefit from the convenience of using job arrays and at the same time use the computing resources efficiently.

If you run too many short jobs (e.g. 500 less-than-1-minute jobs within an 4-hour time window), your other pending jobs may be temporarily throttled. Please understand that this is a way to ensure the scheduler’s normal operation, not intended to cause user inconveniences.

At run time, the environment variable $SGE_TASK_ID uniquely identified a task. The main ideas to pack multiple tasks into one run with minimum change to your job script are to:

change the job task step size.
create a loop inside the job script to execute multiple tasks (equal to the ‘step size’).

Of course, you may need to adjust h_rt to allocate sufficient wall-clock time to run the ‘packed’ version of job script.

Your original job script looks like:

#!/bin/bash
...
#$ -t 1-2000
...
./a.out $SGE_TASK_ID ...

To pack, say, 100 tasks into one run, change your job script to:

$ #!/bin/bash
$ ...
$ #$ -t 1-2000:100
$ ...
$ for i in `seq 0 99`; do
$    my_task_id=$((SGE_TASK_ID + i))
$    ./a.out $my_task_id ...
$ done

Your original job script looks like

  $ #!/bin/csh
  $ ...
  $ #$ -t 1-2000
  $ ...
  $ ./a.out $SGE_TASK_ID ...

To pack, say, 100 tasks into one run, change your job script to:

.. code-block:: console

   $ #!/bin/csh
   $ ...
   $ #$ -t 1-2000:100
   $ ...
   $ foreach i (`seq 0 99`)
   $    @ my_task_id = $SGE_TASK_ID + $i
   $    ./a.out $my_task_id ...
   $  end

Problems with this answer? Please send comments here.

How do I request large memory to run sequential (1-core) program?¶

If you are requesting less than 512 GB, use the h_data to specify the requested memory size, e.g.:

$ qsub -l h_data=512G ...

You can also put -l h_data=512G in your job script file.

In this case, you are requesting a single core (slot), so you should not specify any -pe option.

If you are requesting more than 512GB, please contact us.

Problems with this answer? Please send comments here.

How do I request large memory to run multi-threaded (single node) program?¶

You will use -pe (number of cores) and -l h_data (memory per core) together to specify the total amount of memory you want. Note that the product of (number of cores)*(h_data) must be smaller than the total memory of a compute node, otherwise your job will never start.

For example, request 8 cores with 512G total memory (shared by all 8 cores):

$ qsub -l h_data=64G -pe shared 8 # any other needed resource

If your multi-threaded program will automatically use all CPUs available on the node, add the -l exclusive option, e.g.:

$ qsub -l h_data=64G,exclusive -pe shared 8 # any other needed resource

You can also put -pe shared 8 -l h_data=64G in your job script file.

If you are requesting more than 512 GB total memory, please contact us.

Why cannot I submit too many individual jobs?¶

When there are too many pending jobs, the scheduler may fail to process all them, causing scheduling problems. Therefore, to maintain stability, the system has a limit on how many jobs a user can submit. This limit is usually in the hundreds, and may vary depending on the system’s load.

Most users who submit a huge number of individual jobs should consider using job arrays for one obvious benefit: one job-array job can hold thousands of “tasks” (or individual “runs”) and consume only one (1) job out of the user’s number of jobs limit. A user can then submit hundreds of job arrays (each containing thousands of “runs”). This usually can cover some of the largest “through-put” runs on the cluster.

If each individual task is very short (e.g. finish in a few minutes), users should pack several tasks into one run to increase throughput efficiency. See this FAQ for more details. Running a large number of short jobs is a severe waste of the cluster’s computing power.

For more information about job array, see this page.

Problems with this answer? Please send comments here.

What is “fork: retry: Resource temporarily unavailable”?¶

On the login nodes each user account is limited to a certain number of running processes. The message “… fork: retry: Resource temporarily unavailable”, means you have reached this limit. To circumvent this, you can get an interactive session (on a compute node with desired resources in terms of memory and/or processors, which is not subject to this limit.) and do your work there. See also role of the login nodes.

When submitting a job, I get “Unable to run job: got no response from JSV script…”.¶

This could happen when the scheduler (software) is too busy handling jobs. One way to overcome this problem is to add the following line at the bottom of your initialization files to increase the default timeout limit:

$ export SGE_JSV_TIMEOUT=60

Then run:

$ source ~/.bashrc

(or just log out and log in), and try to submit your job again.

$ setenv SGE_JSV_TIMEOUT 60

Then run:

$ source ~/.cshrc

(or just log out and log in), and try to submit your job again.

Problems with this answer? Please send comments here.

Storage and File systems¶

What file systems are backed up?¶

The home and project file systems are backed up, with a target backup window that runs once per 24 hours to disk-based storage. See Backups

Protecting data from accidental loss¶

Here are several ways to protect your files from accidental loss.

Backup your files to another place, e.g. the hard drive on another computer. See File transfer. Make backup copies of files and directories in a compressed tar file. For example, to create a compressed tar file (.tgz) of all files under directory “myproject”:

$ tar -czf myproject.tgz myproject/

Enter man tar at the shell prompt for more information.

Modify your personal Linux environment to change the cp (copy), mv (move), and rm (remove) commands so that you are prompted for confirmation before any existing file is deleted or overwritten. bash shell: Add the following commands to your $HOME/.bashrc file:

$ alias cp='cp -i'
$ alias mv='mv -i'
$ alias rm='rm -i'

tcsh shell: Add the following commands to your $HOME/.cshrc file:

$ alias cp 'cp -i'
$ alias mv 'mv -i'
$ alias rm 'rm -i'

Modify your personal Linux environment to prevent any existing file from being overwritten by the output redirection (>) symbol. bash shell: Add the following command to your $HOME/.bashrc file:

$ set -o noclobber

tcsh shell: Add the following command to your $HOME/.cshrc file:

$ set noclobber

Use the chmod command to remove your own write access to files you intend to not change or delete. Example:

$ chmod -w myfile

You will be unable to accidentally modify such a file in the future. If you try to delete a file for which you have removed your own write access without specifying the -f (force) flag on the rm command, you will be prompted and have to reply affirmatively before the file will be removed. Enter man chmod at the shell prompt for more information.

My program writes lot of scratch files in my home directory. This results in exceeding my disk space quota. What is the solution?¶

There are several things you can do:

If you are a member of a research group which has contributed nodes to the Hoffman2 Cluster, your PI can purchase additional disk space for use by the members of your group. Each process in your parallel program can write to the local /work on the node it is running on. When the program finishes, you can copy the files off to a place where you have more space. Since /work is local to the nodes, using it is very efficient. You can write to /u/scratch and you have 7 days after the job completes to copy the files somewhere else.

How do I transfer my files from the Hoffman2 Cluster to my machine?¶

For any size file, you can use the scp command to transfer a file or directory from one machine or system to another. For safety reasons, as outlined in the Security Policy for IDRE-Hosted Clusters, always scp from your machine to the IDRE-Hosted cluster. NEVER scp from the IDRE-Hosted cluster back to your local machine.

Is there a simpler way to copy all my files to my new Hoffman2 account?¶

Once you have been notified that your login ID has been added to the Hoffman2 Cluster, login to your local machine and from your local machine’s home directory enter the command:

tar -clpzf - * | ssh loginid@hoffman2.idre.ucla.edu tar -xpzf -

Replace loginid with your Hoffman2 Cluster loginid.

Note that this transfer will not copy any of the hidden (dot) files from your local home directory to your new home directory on the Hoffman2 Cluster. Since many of the dot files in your home directory are operating system version specific, it would not be appropriate or useful to transfer these files.

What is my disk storage quota and usage?¶

From the Hoffman2 Cluster login nodes, at the shell prompt, enter:

myquota

The myquota command will report the usage and quota for filesystems where your userid has saved files, including /u/scratch as well as your home directory. Use the myquota command instead of the quota command.

Problems with the answers in this section? Please send comments here.

Other¶

How do I print my output?¶

There is no printer directly associated with the Hoffman2 Cluster. If you have a printer attached to your local desktop machine, you can copy your file to your local machine and print your file locally. Recall that for security reasons you should issue the scp command from your local machine, and not from the Hoffman2 command line.

Here is a little script that you could save on a Unix/Linux machine that might make printing a text file easier. You might name this script h2print.

scp login_id@hoffman2.idre.ucla.edu:$* .
lpr $*

where login_id is your Hoffman2 Cluster user name (i.e., login ID). You can omit login_id@ if your user_id on your local machine is the same as your Hoffman2 Cluster login ID. Note the period (.) at the end of the scp command Line. Mark the script as executable with the chmod command:

$ chmod +x h2print

To print a Hoffman2 text file in your home directory, from your local machine’s command prompt, enter:

$ h2print hoffman2_filename

where hoffman2_filename is the name of your text file on the Hoffman2 Cluster that you want to print.

The scp command will prompt you for your Hoffman2 Cluster password, unless you have previously setup an rsa key pair on your local machine with the ssh-keygen -t rsa command, and appended a copy of the public key (id_rsa.pub) to ~/.ssh/authorized_keys on your Hoffman2 Cluster account.

Frequently asked questions¶

Getting help¶

Accounts¶

How do I create an account?¶

What is the status of my user account application?¶

I no longer need my user account. What should I do?¶

My SSH client says: Permission denied, please try again¶

I need to change my Hoffman2 sponsor, how do I do that?¶

Getting access to project folders in a different research group¶

I would like to collaborate with another Hoffman2 user. How can I share data with them?¶

Changing file ownership via chown¶

Changing file access via chmod¶

Acknowledging the Hoffman2 Cluster¶

How can I acknowledge Hoffman2 Shared Cluster in my presentations or publications?¶

Applications, compilers and libraries¶

How to load certain applications in your path / How to set up your environment¶

How to use environmental modules interactively¶

After loading the Intel compiler module, why is mpicc/mpicxx/mpif90 still not using Intel compiler?¶

Connecting, Authentication, SSH public-keys¶

Connecting for the first time¶

Public hosts hostkey fingerprints¶

Set-up SSH public-key authentication¶

SSH public-key authentication not working?¶

Data transfers¶

When I Log in to the Globus web application, I get a “Missing Identity Information” error¶

Job Errors/ Job Scheduler¶

An IDRE consultant sent me an email about a lot of left over jobs running under my userid. How do I delete them?¶

In an interactive session (via qrsh), I am getting “not enough memory” error messages and my application is terminated abruptly. Why?¶

Why is my job still waiting in the queue?¶

I have a lot of jobs in error state E. How do I find out what the problem is?¶

What queues can I run my jobs in?¶

When will my job run?¶

Why did my highp job not start within 24 hours?¶

How much virtual memory should I request in job submission?¶

How do I pack multiple short tasks of a job array?¶

How do I request large memory to run sequential (1-core) program?¶

How do I request large memory to run multi-threaded (single node) program?¶

Why cannot I submit too many individual jobs?¶

What is “fork: retry: Resource temporarily unavailable”?¶

When submitting a job, I get “Unable to run job: got no response from JSV script…”.¶

Storage and File systems¶

What file systems are backed up?¶

Protecting data from accidental loss¶

Data Sharing on Hoffman2¶

My program writes lot of scratch files in my home directory. This results in exceeding my disk space quota. What is the solution?¶

How do I transfer my files from the Hoffman2 Cluster to my machine?¶

Is there a simpler way to copy all my files to my new Hoffman2 account?¶

What is my disk storage quota and usage?¶

Other¶

How do I print my output?¶

Why did my `highp` job not start within 24 hours?¶