This chapter explains how to set up Hadoop to run on a cluster of machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work they need to run on multiple nodes.
Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.
How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves).
Enable “hadoop” user to password-less SSH login to slaves
Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.
Details about password-less SSH login can be found in Enabling Password-less ssh Login.
Install software needed by Hadoop
The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.
Install Java JDK on UBUNTU
Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.
user@ubuntuvm:~$ sudo apt-get java-7-openjdk-i386
As an example in this tutorial, the JDK is installed into
/usr/lib/jvm/java-7-openjdk-i386
You may need to make soft link to /usr/java/default from the actual location where you installed JDK.
Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$JAVA_HOME/bin:$PATH
Hadoop 2.x.x Configuration-
Step1. Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop-2.5.2.
You can unpack the tar ball to a directory. In this example, we unpack it to
/home/user/Hadoop-2.5.2
which is a directory under the hadoop Linux user’s home directory.
The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.
Step2. Configure environment variables for the “hadoop” user
We assume the “hadoop” user uses bash as its shell.
Add these lines at the bottom of ~/.bashrc on all nodes:
goto terminal >> sudo gedit .bashrc >> press enter >> put password “password”
Step3. Put following path to .bashrc file
export HADOOP_HOME=$HOME/hadoop-2.5.2
export HADOOP_CONF_DIR=$HOME/hadoop-2.5.2/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.5.2
export HADOOP_COMMON_HOME=$HOME/hadoop-2.5.2
export HADOOP_HDFS_HOME=$HOME/hadoop-2.5.2
export YARN_HOME=$HOME/hadoop-2.5.2
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export PATH=$PATH:$HOME/hadoop-2.5.2
Step4- Configure Hadoop-2.5.2 Important files
The configuration files for Hadoop is under /home/user/hadoop-2.5.2/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.
i- core-site.xml–
Here the NameNode runs on localhost.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
ii- yarn-site.xml –
The YARN ResourceManager runs on localhost and supports MapReduce shuffle.
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-service.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
iii- hdfs-site.xml–
The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.
1. First property create number of replication for datanode with name dfs.replication
2. Second property give the name of namenode directory dfs.namenode.name.dir with file directory you have to create at “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode“
3. Third property give the name of datanode directory dfs.datanode.name.dir with file directory you have to create “/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode“
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/user/hadoop-2.5.2/hadoop2_data/hdfs/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:/home/user/hadoop-2.5.2/hadoop2_data/hdfs/datanode</value> </property> </configuration>
iv- mapred-site.xml–
First copy mapred-site.xml.template to mapred-site.xml and add the following content.
At terminal of Ubuntu $>> cp mapred-site.xml.template mapred-site.xml >> press enter
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
v- hadoop-env.sh Here we set java environment variable
Step5. After configuring all five important files you have format the namenode with following command on ubuntu terminal.
user@ubuntuvm:~$ cd hadoop-2.5.2/bin/
user@ubuntuvm:~/hadoop-2.5.2/bin$ ./hadoop namenode -format
Step6. After formatting namenode we have to start all daemon services of hadoop-2.5.2
move to first /home/user/hadoop-2.5.2/sbin
user@ubuntuvm:~$ cd hadoop-2.5.2/sbin/
i- datanode daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start datanode
ii- namenode daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./hadoop-daemon.sh start namenode
iii- resourcemanager daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start resourcemanager
iv- nodemanager daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./yarn-daemon.sh start nodemanager
v- jobhistoryserver daemon service
user@ubuntuvm:~/hadoop-2.5.2/sbin$ ./mr-jobhistory-daemon.sh start historyserver
Step7. To verify all daemon services please write following command at terminal
user@ubuntuvm:~/hadoop-2.5.2/sbin$ jps
3869 DataNode
4067 ResourceManager
4318 NodeManager
4449 JobHistoryServer
4934 NameNode
5389 Jps
Suppose if some of the services are not started yet please verify logs in logs folder of hadoop-2.5.2 at following location
“/home/user/hadoop-2.5.2/logs/”
Here you can check each every file for logs
hadoop-user-namenode-ubuntuvm.log
hadoop-user-namenode-ubuntuvm.out
hadoop-user-datanode-ubuntuvm.log
hadoop-user-datanode-ubuntuvm.out
—–
—-
etc.
Step8. To verify all services at browser goto filefox of virtaul machine and open following url…
“http://localhost:50070/”
We can check two more files also
1. slaves-
localhost
For other Delete localhost and add all the names of the TaskTrackers, each in on line. For example:
hofstadter
snell
2. master-
localhot