Job scheduling policy

Univa Grid Engine

All jobs must be started through the queuing system

The Univa Grid Engine (UGE) is the job management system used on the Hoffman2 Cluster to ensure balanced use of resources by matching job needs to available compute resources. UGE serves as the job scheduler and enforces the queuing policies described on this page. It will always send you mail in case your batch job fails.

All jobs run on Hoffman2 Cluster compute nodes, whether batch or interactive, must be started through the queuing system.

Any processes on compute nodes not started by Grid Engine will be terminated without prior notice.

Queues

The Hoffman2 Cluster has the following types of queues:

  • Queues with time limit of 14 days (highp)

  • Queues with time limit of 24 hours

  • Queues which use the interactive nodes

Queues with time limit of 14 days (highp)

The purpose of these queues is to allow users who belong to a resource group which has contributed nodes to the Hoffman2 Cluster, to use their group’s nodes for batch jobs which need to run for an extended period of time. Users who belong to more than one resource group and want to direct their job to use a particular group’s nodes are able to do so.

  • Jobs may run for as long as 14 days (336 hours).

  • Available only to members of resource groups which have contributed nodes to the Hoffman2 Cluster.

  • All members of a resource group are limited to using the nodes that their group has contributed.

  • Nodes contributed by a research group as part of the Shared Cluster Program will be made available within 24 hours of being requested. Note that 24 hours is a maximum wait period. Nodes could be available sooner depending on currently running and pending jobs.

Important

The 24-hour availability does not mean that every job submitted to these queues will start within 24 hours, because the nodes may be in use by other members of the same resource group.

Queues with time limit of 24 hours

The purpose of these queues is to allow users to run batch jobs on the extended shared Hoffman2 Cluster and to utilize free cycles on nodes contributed by resource groups. The 24 hour queues have access to IDRE-contributed nodes from the Base Shared Cluster and to resource group processors that are not currently running jobs.

  • Jobs may run for as long as 24 hours.

  • Available to all users on the Hoffman2 Cluster including general campus users.

  • All members of a resource group that has contributed nodes to the cluster may use more resources than their own group’s contribution.

  • There is no guaranteed start time in these queues. Start time is subject to overall cluster utilization, and the availability of nodes that can satisfy the amount of memory and number of processors requested by the job.

Queues which use the interactive compute nodes

The interactive queues are intended for interactive sessions, including licensed applications which IDRE has purchased for general use. These queues include IDRE-contributed nodes which are not included in the 14-day queues. See How to Get an Interactive Session through UGE for more information.

  • Sessions have a 24-hour time limit.

  • Available to all users on the Hoffman2 Cluster.

  • Limited to 8 processors per user.

  • Sessions should start immediately in these queues depending on number of processors requested. Immediate startup of interactive sessions is guaranteed for sessions requesting a single processor. Please contact user support in case all requirements are met and an interactive session has not started within one or two minutes.

  • Jobs have a 2-hour time limit.

  • Available to all users on the Hoffman2 Cluster.

  • Each job or array-job task must run on a single node.

  • There is no guaranteed start time. Jobs usually start running within 5-10 minutes.

Reserve Adequate Memory and Processors

Because more than one job may run at the same time on the same multi-processor node, a job or interactive session which uses more memory or more processors than it has requested will adversely impact other jobs which are also running on that node. If you are not sure how much memory your program will use, you should reserve an entire node for your job.

You need to request at least as many processors as your program uses threads. Your job or interactive session does not get the exclusive use of more than one processor on a node unless you tell the Univa Grid Engine to use a parallel environment. See Running a Batch Job or How to Get an Interactive Session through UGE for information about using UGE parallel environments to reserve additional memory or processors.

Jobs or interactive sessions using more processors or significantly more memory than was reserved with UGE may be terminated without prior notice by the System Administrator.

It is important to specify your resource requirements accurately so UGE is able to pick the correct resources for your job’s execution. You should always specify the amount of time that your job requires (h_rt or time parameter) or the job scheduler will enforce its default which currently is two hours. Do not request more time, memory or processors than your job or interactive session requires, because that will delay its starting. It may also defeat the job scheduler’s back-filling capability and waste cluster resources.

Checkpointing

Programs that require more than 24 hours to complete and which need to run in queues limited to 24 hours should checkpoint before 24 hours is up so that they can be continued in another job. See How to Checkpoint for more information.

Special Requests

If your account is in the general campus group and your program for some reason absolutely requires more than 24 hours to run and cannot be stopped and restarted in the 24 hour time frame, you can make a special request to have it run for a longer period of time. Send your request by submitting a ticket to Technical support including the following information:

  • Your name

  • Your sponsor or Principle Investigator’s name

  • Your Hoffman2 login ID

  • An explanation of why access to a longer duration queue is critical for your work.

  • How long your job needs to run (e.g., 3 days).

  • The duration of your request (i.e., how many days or weeks will you need to be able to access this queue).