Main navigation

Using Hadoop

This page is under construction

General information

The supported Hadoop version is 2.6.0.  It is installed at /u/local/apps/hadoop/hadoop-2.6.0. We recommend following the sections below to run Hadoop on the Hoffman2 cluster, but if you are an expert user, you can load Hadoop into your environment with:

$ module load hadoop

This will appropriately set the environment variables HADOOP_HOME, HADOOP_INSTALL, HADOOP_MAPRED_HOME, HADOOOP_COMMON_HOME, HADOOP_HDFS_HOME, YARN_HOME, HADOOP_COMMON_LIB_NATIVE_DIR, and PATH.

You must not run Hadoop on the login nodes; either use qrsh to get an interactive session or use qsub to submit a batch job.

The following sections provide simple examples of using Hadoop on Hoffman2 cluster.

Submitting a Hadoop batch job

Here is a simple word-count example to illustrate the methods to submit a Hadoop batch job.

Suppose you have a data file $SCRATCH/in/file (download):

$ cat $SCRATCH/in/file
This is one line.
This is another line. 
This is the last line.

We want to use the Hadoop Map-Reduce function to count the words in the file. Here is an example of the job script (download):

#!/bin/sh
#$ -cwd
#$ -pe shared 8
#$ -l h_rt=8:00:00,h_data=1G,hadoop

# configure the hadoop cluster based on a group of nodes
source /u/local/bin/hdfsstart.sh

# your map-reduce work is here
hadoop fs -copyFromLocal $SCRATCH/in /in
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount /in /out
hadoop fs -cat /out/*

# cleanup the tmp files and hadoop services on the nodes
source /u/local/bin/hdfsstop.sh
# end of job script

To submit the job script (saved as $SCRATCH/wordcount.sh):

$ cd $SCRATCH
$ qsub wordcount.sh

The scripts hdfsstart.sh and hdfsstop.sh automatically set up and clean up the Hadoop cluster. You do not need to run module load hadooop in this case. What you need to do is to replace the part between “hdfsstart.sh” and “hdfsstop.sh” with your own map/reduce work.

Using Hadoop interactively

Running Hadoop in an interactive session follows the same idea as in batch job submission. However, users will use the qrsh command (instead of qsub). For example, to get an interactive session for 4 hours to use Hadoop, the qrsh command should be:

$ qrsh -l h_rt=4:00:00,hadoop

You may have to wait a few minutes. When the prompt returns, you are placed on a compute node. Start the Hadoop cluster using:

$ source /u/local/bin/hdfsstart.sh

Use the jps command to check whether the Hadoop services are successfully started. If everything works, the output should look similar to the following:

$ jps
909 ResourceManager
1148 NodeManager
731 SecondaryNameNode
32614 NameNode
383 DataNode
1691 Jps

Make sure  ResourceManager, NodeManager, SecondaryNameNode, NameNode, DataNode are present in the output. The numbers (process IDs) shown in the first column may be different.

At this point, you can do map-reduce work interactively (before the time limit of the qrsh session is reached).

When you are done, before you end your qrsh session, run the following command to cleanup the Hadoop services:

$ source /u/local/bin/hdfsstop.sh

 

Report Typos and Errors
UCLA OIT

© 2016 UC REGENTS TERMS OF USE & PRIVACY POLICY