*******
Storage
*******

Sensitive Data
==============
.. important::
 Personal information and other sensitive data, including statutory, regulatory, and contractually protected data - for example, human subjects research, restricted research, student and educational data, and personal health information (PHI) — are prohibited on the Hoffman2 Cluster.


Storage Overview
================

The Hoffman2 Cluster home directories, temporary scratch directories, and group-purchased project directories are supported by 6PB of NetApp and 2PB of all-flash VAST network-based storage systems.  A fixed amount of home and scratch directory space is provided free to all users.  Research groups whose members need to share data or applications or whose users need storage beyond their 60GB home directory allocation can :ref:`purchase additional HPC storage in increments of 1TB per year <How to purchase storage>`.

A fourth filesystem, locally mounted to each of the :term:`compute nodes` (and *not* exported between them), is also available as local scratch via the :term:`scheduler` set ``$TMPDIR`` variable. 


.. _user_filesystems:
.. list-table:: Filesystems available for user data
   :widths: auto
   :header-rows: 1
   :class: tight-table

   * - Name
     - Path
     - Type
     - Quota (disk/file)
     - Backups
     - Temporary
   * - :ref:`$HOME <Your home directory>`
     - ``/u/home/``
     - NFS
     - 60 GB / 1,000,000
     - Yes
     - No
   * - :ref:`project <Your project directory>`
     - ``/u/project/``
     - NFS
     - varies
     - Yes
     - No
   * - :ref:`$TMPDIR <$TMPDIR>`
     - ``/work``
     - local disk
     - varies
     - No
     - Yes
   * - :ref:`$SCRATCH <Your scratch directory>`
     - ``/u/scratch/``
     - NFS
     - 2 TB / 5,000,000
     - No
     - Yes

| **Table description**
| *Name* - the general storage name, or the :ref:`environmental variable <Unix environment & environmental variables>` that points to it
| *Path* - the path to the storage directory
| *Type* - remote or local file system
| *Quota* - limits placed on your data, both in terms of disk usage and number of files
| *Backups* - your data is backed up for disaster recovery purposes 
| *Temporary* - indicates whether the space is routinely reclaimed or not


Your home directory
-------------------

Each user account is given a 60GB home directory (``$HOME``), which is physically located on our fast, highly responsive NetApp storage. Files in your home directory have no expiration date.  The name of your home directory follows the patters:

.. code-block:: bash

  /u/scratch/${USER:0:1}/$USER


where ``${USER:0:1}`` is the first character of your username and ``$USER`` is the environmental variable corresponding to you account name. For example, if your username is ``joebruin``, your home directory would be: ``/u/home/j/joebruin``.

Your home directories, mounted on every node in the cluster, is a place to store and manage your scripts and source code.  The ``$HOME`` environment variable refers to your home directory. It is better to use ``$HOME`` than to hard code its path in your scripts and source code.

To check the path of your home directory you can use the following command:

.. code-block:: bash

  echo $HOME


Your home directory is periodically backed up. In case of catastrophic hardware failure, files in home directories will be restored from the last backup that was taken. The purpose of the backup is not to be able to restore files that you accidentally delete or overwrite. See :ref:`Protecting from accidental loss <Protecting data from accidental loss>`.


Your project directory
----------------------

.. tip:: If members of your research group require more than 60GB of storage, need to share data or need a common application area, backed-up project storage space is available for purchase in 1TB increments.  Please talk to your faculty sponsor to see :ref:`Purchasing additional resources <Purchasing additional resources>`.

If your resource group has purchased additional storage, you may have a directory on that filesystem. If that is the case, in your home directory there will be convenient symlinks to your project subdirectory(s). To see what project storage space is available for your user account, issue the command:

.. code-block:: console
 
 $ ls -l $HOME | grep project-

For example, user ``joebruin``, part of the ``bruins`` group but also part of the ``bears`` group, with a project directory in ``/u/project/bruins/joebruin`` and a project directory in ``/u/project/bears/joebruin``, would see the following symlinks in the ``$HOME`` directory:

.. code-block:: console

   $HOME/project-bruins 

is the same as:

.. code-block:: console

   /u/project/bruins/joebruin


.. note:: Data written to a project directory do not count over the 60GB :ref:`quota <Storage quotas>` of you ``$HOME`` as they are written under the ``/u/project`` filesystem and not the ``/u/home`` one.


therefore for user ``joebruin`` the directories:


.. code-block:: console

   lrwxrwxrwx. 1 joebruin bruins 27 May 21  2019 /u/home/j/joebruin/project-bruins -> /u/project/bruins/joebruin
   lrwxrwxrwx. 1 joebruin bruins 27 May 21  2019 /u/home/j/joebruin/project-bears -> /u/project/bears/joebruin


Your scratch directory
----------------------

Each user has a scratch directory (``$SCRATCH``), with an individual quota of 2TB that is physically located on all-flash VAST network-based storage.  The name of your scratch directory is like:

.. code-block:: bash

  /u/scratch/${USER:0:1}/$USER

where ``${USER:0:1}`` is the first character of your username. For example, if your user name is ``joebruin``, your scratch directory is ``/u/scratch/j/joebruin``. On the Hoffman2 Cluster the ``$SCRATCH`` environment variable refers to your scratch directory.

To check the path of your scratch directory you can use the following command:

.. code-block:: bash

  echo $SCRATCH


.. note:: Your scratch directory is a temporary space that is accessible from any node in the cluster and is intended to support your output from large jobs for later retrieval and analysis. Any important file should not be stored there, as files in your scratch directory are eligible for removal after 14 days. This retention policy guarantees that enough space exists for the creation of new files. 


Storage quotas
==============
To see your quota and current disk space usage for your home, storage and any project directories, at the shell prompt, enter:

.. code-block:: bash

 myquota

.. include:: Backups.inc

File systems for temporary use
==============================
There are two kinds of file systems for temporary files that you may use:

* :ref:`$TMPDIR` - local to each :term:`compute node`, deleted at the end of a job 
* :ref:`$SCRATCH` - mounted everywhere on each :term:`node` on the cluster, files within it are deleted after 14 days 

The purpose of these file systems is to accommodate data used by jobs while they are executing. 

.. warning:: Do not write files in any /tmp directory. 


$TMPDIR
-------

The :term:`scheduler` sets up for each :ref:`interactive session <Requesting interactive sessions>` or :ref:`batch job <Submitting batch jobs>` a temporary directory, which can be referenced to with the ``$TMPDIR`` environmental variable. This directory is created on the local disk of the master node where the job/interactive session is running. Each node has local storage that is used as local scratch, your job can use up to its upper limit (which will vary depending on the generation of the node) only if you request :ref:`exclusive node access <exclusive>`.  Access to this directory may be faster than to access your home, project and scratch directories.  The files in this directory are not visible from all nodes in a parallel job; each node has its own directory (therefore the use of this temporary directory is not compatible with parallel I/O).  

.. note:: The batch system creates this directory when your job starts and deletes it when your job ends.  

.. tip:: Files in $TMPDIR will be deleted by the job scheduler at the end of your job or interactive session. If you want to keep files written to ``$TMPDIR``, tell your submission script to copy them to permanent space (e.g., ``$HOME``) before the end of your job or session. Files written to ``$TMPDIR`` are not backed up.

  For example, in your job script, you could have something like:

  .. code-block:: console

      # enter the temporary directory:
      cd $TMPDIR                  
      # ... execute your code
      # ... write some files to $TMPDIR
      # copy files from $TMPDIR to home directory:
      cp -r $TMPDIR/*  $HOME/     
      # job exits; $TMPDIR and its files are deleted automatically


Use $TMPDIR for life-of-the-job and high-activity files to avoid the overhead of network traffic associated with the network file systems and improve your job’s throughput. For your convenience examples of how to modify you C/Fortan/Java/Perl/Python code to perform I/O on ``$TMPDIR`` are given here:


.. toctree::
   :maxdepth: 1

   ./sgeenviron.c.rst
   ./sgeenviron.f90.rst
   ./sgeenviron.java.rst
   ./sgeenviron.pl.rst
   ./sgeenviron.py.rst


Using ``$TMPDIR`` may not be suitable for MPI-style jobs because it is not visible by other compute nodes within the same MPI run.

.. warning:: Files stored on a compute node's local file system, which are not related to a job running on that node will be deleted without notice.


$SCRATCH
--------

.. code-block:: console

 $SCRATCH

The global scratch file system is mounted on all nodes of the Hoffman2 Cluster. There is a 2TB per user limit. The system provides an environment variable ``$SCRATCH`` which corresponds to a unique directory for your login ID. 

.. tip:: To submit a job that will write large and frequent output to ``$SCRATCH`` you could try any of the following:

  1. make sure that any variable for temporary files storage that your software may use is pointed to ``$SCRATCH``
  2. make sure that your job submission script copies relevant files to ``$SCRATCH`` and starts the computation there, for example:

   .. code-block:: console

       # create a variable with the local directory name:
       RUN_DIR=`pwd`                      
       # create a directory in $SCRATCH after the unique $JOB_ID:
       mkdir $SCRATCH/$JOB_ID             
       # copy files needed for the run:
       cp -rp ./* $SCRATCH/$JOB_ID        
       cd $SCRATCH/$JOB_ID                
       # enter the temporary directory:
       # ... do some computations
       # ... write some files to $SCRATCH
       # copy files from $TMPDIR to home directory:
       cp $SCRATCH/$JOB_ID/*  $HOME/      
       # job exits; $SCRATCH/$JOB_ID files will be deleted in 14 days

  3. modify your code so that it will perform the intensive part of the I/O on ``$SCRATCH``


As the ``$SCRATCH`` file system resides on fast flash-based storage, writing to ``$SCRATCH`` is especially recommended,  for performance reasons, to parallel jobs, especially those with high I/O requirements.

Under normal circumstances, files you store in ``$SCRATCH`` are allowed to remain there for 14 days. Any files older than 14 days may be automatically deleted by the system to guarantee that enough space exists for the creation of new files. However, there may be occasions when even after all files older than 14 days have been deleted, there is still insufficient free space remaining. Under that circumstance, files belonging to those users who are using the preponderance of the disk space in ``$SCRATCH`` will be deleted even though they have not been there for 14 days. Files written to $SCRATCH are not backed up.

.. warning:: Files on the global scratch file system which are outside of ``$SCRATCH`` directories will be deleted without further notice.