This page is under construction
General information
The supported Hadoop version is 2.6.0. It is installed at /u/local/apps/hadoop/hadoop-2.6.0. We recommend following the sections below to run Hadoop on the Hoffman2 cluster, but if you are an expert user, you can load Hadoop into your environment with:
$ module load hadoop
This will appropriately set the environment variables HADOOP_HOME
, HADOOP_INSTALL
, HADOOP_MAPRED_HOME
, HADOOOP_COMMON_HOME
, HADOOP_HDFS_HOME
, YARN_HOME
, HADOOP_COMMON_LIB_NATIVE_DIR
, and PATH
.
You must not run Hadoop on the login nodes; either use qrsh
to get an interactive session or use qsub
to submit a batch job.
The following sections provide simple examples of using Hadoop on Hoffman2 cluster.
Submitting a Hadoop batch job
Here is a simple word-count example to illustrate the methods to submit a Hadoop batch job.
Suppose you have a data file $SCRATCH/in/file (download):
$ cat $SCRATCH/in/file This is one line. This is another line. This is the last line.
We want to use the Hadoop Map-Reduce function to count the words in the file. Here is an example of the job script (download):
#!/bin/sh #$ -cwd #$ -pe shared 8 #$ -l h_rt=8:00:00,h_data=1G,hadoop # configure the hadoop cluster based on a group of nodes source /u/local/bin/hdfsstart.sh # your map-reduce work is here hadoop fs -copyFromLocal $SCRATCH/in /in hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /in /out hadoop fs -cat /out/* # cleanup the tmp files and hadoop services on the nodes source /u/local/bin/hdfsstop.sh # end of job script
To submit the job script (saved as $SCRATCH/wordcount.sh):
$ cd $SCRATCH $ qsub wordcount.sh
The scripts hdfsstart.sh
and hdfsstop.sh
automatically set up and clean up the Hadoop cluster. You do not need to run module load hadooop
in this case. What you need to do is to replace the part between “hdfsstart.sh” and “hdfsstop.sh” with your own map/reduce work.
Using Hadoop interactively
Running Hadoop in an interactive session follows the same idea as in batch job submission. However, users will use the qrsh
command (instead of qsub
). For example, to get an interactive session for 4 hours to use Hadoop, the qrsh
command should be:
$ qrsh -l h_rt=4:00:00,hadoop
You may have to wait a few minutes. When the prompt returns, you are placed on a compute node. Start the Hadoop cluster using:
$ source /u/local/bin/hdfsstart.sh
Use the jps
command to check whether the Hadoop services are successfully started. If everything works, the output should look similar to the following:
$ jps 909 ResourceManager 1148 NodeManager 731 SecondaryNameNode 32614 NameNode 383 DataNode 1691 Jps
Make sure ResourceManager
, NodeManager
, SecondaryNameNode
, NameNode
, DataNode
are present in the output. The numbers (process IDs) shown in the first column may be different.
At this point, you can do map-reduce work interactively (before the time limit of the qrsh session is reached).
When you are done, before you end your qrsh session, run the following command to cleanup the Hadoop services:
$ source /u/local/bin/hdfsstop.sh