Dinesh on Java

Categories: Hadoop

Hadoop Installation Tutorial (Hadoop 1.x)

Software Required-
Setup Virtual Machine

Step1. >goto traffictool.net->goto ubuntu(Ubuntu1404)->download it->extract it

Step2. Suppose your directory after extract it
“D:personal datahadoopUbuntu1404”

Step3. >goto google->search VMWARE PLAYER->goto result select DESKTOP & END USER->download it->install it

Step4. After installation of virtual machine goto-“D:personal datahadoopUbuntu1404”

Step5. double click “Ubuntu.vmx” then virtual is running after that open now as following.

Step6. ->>Through the VM machine download hadoop release Hadoop-1.2.1(61M) >> extract it “hadoop-1.2.1.tar.gz”

Step7. In this tutorial Hadoop install into following location

/home/user/hadoop-1.2.1
Step8. ->>install Java in Linux
sudo apt-get install openjdk-7-jdk

Step9. In this tutorial JDK install into following location
/usr/lib/jvm/java-7-openjdk-i386

Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).

Step10. Enable “hadoop” user to password-less SSH login to slaves-
Just for our convenience, make sure the “hadoop” user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found Enabling Password-less ssh Login.

Step11. Hadoop Configuration
Configure environment variables of “hadoop” user
Open terminal of command prompt and set environment variable as follows

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export JAVA_INSTALL=/home/user/hadoop-1.2.1
and Hadoop Path assign follows
export HADOOP_COMMON_HOME=”/home/hadoop/hadoop/”
export PATH=$HADOOP_COMMON_HOME/bin/:$PATH

The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message “Hadoop common not found”.

The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.

Step12. Configure Important files for Hadoop
A. /home/user/hadoop-1.2.1/conf/hadoop-env.sh
Add or change these lines to specify the JAVA_HOME and directory to store the logs:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export HADOOP_LOG_DIR=/home/user/hadoop-1.2.1/logs

B. /home/user/hadoop-1.2.1/conf/core-site.xml (configuring NameNode)
Here the NameNode runs on 127.1.1.1. or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>       
 <name>fs.default.name</name>       
 <value>hdfs://localhost:9000</value>  
 </property>
</configuration>

C. /home/user/hadoop-1.2.1/conf/hdfs-site.xml (Configuring DataNode)
dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>       
 <name>dfs.http.address</name>       
 <value>localhost:50070</value>  
 </property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>/lhome/hadoop/data/dfs/name/</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/lhome/hadoop/data/dfs/data/</value>
</property>
</configuration>

D. /home/user/hadoop-1.2.1/conf/mapred-site.xml (Configuring JobTracker)
Here the JobTracker runs on 127.1.1.0. or localhost

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>       
 <name>mapreduce.jobhistory.address</name>       
 <value>localhost:10020</value>  
 </property>

<property>
<name>mapred.job.tracker</name>
<value>10.1.1.2:9001</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/hadoop/data/mapred/system/</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/lhome/hadoop/data/mapred/local/</value>
</property>

</configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

E. /home/user/hadoop-1.2.1/conf/slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:
jobtrackname1
jobtrackname2
jobtrackname3
jobtrackname4
jobtrackname5
jobtrackname6

F. Start Hadoop
We need to start both the HDFS and MapReduce to start Hadoop.

1. Format a new HDFS
On NameNode
$ hadoop namenode -format
Remember to delete HDFS’s local files on all nodes before re-formating it:
$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf

2. Start HDFS
On NameNode :

$ start-dfs.sh

3.Check the HDFS status:
On NameNode :

$ hadoop dfsadmin -report
There may be less nodes listed in the report than we actually have. We can try it again.

4. Start mapred:
On JobTracker:

$ start-mapred.sh

5.Check job status:

$ hadoop job -list

Shut down Hadoop cluster

We can stop Hadoop when we no long use it.

Stop HDFS on NameNode:

$ stop-dfs.sh

Stop JobTracker and TaskTrackers on JobTracker:

$ stop-mapred.sh

Dinesh Rajput

Dinesh Rajput is the chief editor of a website Dineshonjava, a technical blog dedicated to the Spring and Java technologies. It has a series of articles related to Java technologies. Dinesh has been a Spring enthusiast since 2008 and is a Pivotal Certified Spring Professional, an author of a book Spring 5 Design Pattern, and a blogger. He has more than 10 years of experience with different aspects of Spring and Java design and development. His core expertise lies in the latest version of Spring Framework, Spring Boot, Spring Security, creating REST APIs, Microservice Architecture, Reactive Pattern, Spring AOP, Design Patterns, Struts, Hibernate, Web Services, Spring Batch, Cassandra, MongoDB, and Web Application Design and Architecture. He is currently working as a technology manager at a leading product and web development company. He worked as a developer and tech lead at the Bennett, Coleman & Co. Ltd and was the first developer in his previous company, Paytm. Dinesh is passionate about the latest Java technologies and loves to write technical blogs related to it. He is a very active member of the Java and Spring community on different forums. When it comes to the Spring Framework and Java, Dinesh tops the list!