Installing Hadoop 3.2.0 multi-node cluster on Cent OS

Submitted by admin on Sat, 02/09/2019 - 12:46

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

Use the below steps for installing Hadoop 3.2.0 multi-node cluster on Cent OS/Rhel

Create Host File on Each Node

[root@hadoop3 ~]# cat /etc/hosts   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6 hadoop1 hadoop2 hadoop3


Corresponding roles are:-

nn1–     HDFS — Namenode and YARN — Resourcemanager
dn1–     HDFS — Datanode and YARN — Nodemanager
dn2–   HDFS — Datanode and YARN — Nodemanager

Here hadoop1 is the master node and hadoop2 and hadoop3 are the slave nodes:-

Install Java:-

yum install java

create the below users and groups

useradd hadoop
groupadd hadoopgrp
gpasswd -a hadoop hadoopgrp

setup default password for the hadoop user:-

passwd hadoop

Login to node-master as the hadoop user, and generate an ssh-key:

[hadoop@hadoop1 ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/
The key fingerprint is:
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|  . o            |
|   + . +         |
|    E + S        |
|   . * = .       |
|    + = =        |
|   . o O .       |
|    . +o=.       |

Distribute Authentication Key-pairs for the user hadoop to all the nodes:-

The master node will use an ssh-connection to connect to other nodes with key-pair authentication, to manage the cluster.

[hadoop@hadoop1 ~]# ssh-copy-id -i $HOME/.ssh/
[hadoop@hadoop1 ~]$ ssh-copy-id -i $HOME/.ssh/
[hadoop@hadoop1 ~]$ ssh-copy-id -i $HOME/.ssh/


check path of the java binary installed:-

[root@hadoop3 ~]#  alternatives --config java

There are 2 programs which provide 'java'.

  Selection    Command
   1           /usr/java/jdk1.8.0_201-amd64/jre/bin/java
*+ 2           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

[hadoop@hadoop1 ~]$ update-alternatives --display java
java - status is auto.
 link currently points to /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java - priority 1800191
 slave jre: /usr/lib/jvm/java-1.8.0-openjdk-
 slave jre_exports: /usr/lib/jvm-exports/jre-1.8.0-openjdk-
 slave jjs: /usr/lib/jvm/java-1.8.0-openjdk-
 slave keytool: /usr/lib/jvm/java-1.8.0-openjdk-
 slave orbd: /usr/lib/jvm/java-1.8.0-openjdk-
 slave pack200: /usr/lib/jvm/java-1.8.0-openjdk-
 slave rmid: /usr/lib/jvm/java-1.8.0-openjdk-
 slave rmiregistry: /usr/lib/jvm/java-1.8.0-openjdk-
 slave servertool: /usr/lib/jvm/java-1.8.0-openjdk-
 slave tnameserv: /usr/lib/jvm/java-1.8.0-openjdk-
 slave policytool: /usr/lib/jvm/java-1.8.0-openjdk-
 slave unpack200: /usr/lib/jvm/java-1.8.0-openjdk-
 slave java.1.gz: /usr/share/man/man1/java-java-1.8.0-openjdk-
 slave jjs.1.gz: /usr/share/man/man1/jjs-java-1.8.0-openjdk-
 slave keytool.1.gz: /usr/share/man/man1/keytool-java-1.8.0-openjdk-
 slave orbd.1.gz: /usr/share/man/man1/orbd-java-1.8.0-openjdk-
 slave pack200.1.gz: /usr/share/man/man1/pack200-java-1.8.0-openjdk-
 slave rmid.1.gz: /usr/share/man/man1/rmid-java-1.8.0-openjdk-
 slave rmiregistry.1.gz: /usr/share/man/man1/rmiregistry-java-1.8.0-openjdk-
 slave servertool.1.gz: /usr/share/man/man1/servertool-java-1.8.0-openjdk-
 slave tnameserv.1.gz: /usr/share/man/man1/tnameserv-java-1.8.0-openjdk-
 slave policytool.1.gz: /usr/share/man/man1/policytool-java-1.8.0-openjdk-
 slave unpack200.1.gz: /usr/share/man/man1/unpack200-java-1.8.0-openjdk-
Current `best' version is /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java.

Take the value of the current link and remove the trailing /bin/java

so path will be /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 and this will be the path for java_home

Jps needs to be installed:-

Install the below packages.

[root@hadoop1 ~]# yum list java*devel*
Loaded plugins: fastestmirror, security
Loading mirror speeds from cached hostfile
 * base:
 * epel:
 * extras:
 * rpmforge:
 * updates:
Available Packages
java-1.5.0-gcj-devel.x86_64                                                                                                            base
java-1.6.0-openjdk-devel.x86_64                                                      1:                                                  base
java-1.7.0-openjdk-devel.x86_64                                                      1:                                                 updates
java-1.8.0-openjdk-devel.x86_64                                                      1:                                                    updates
java-1.8.0-openjdk-devel-debug.x86_64                                                1:                                                    updates
[root@hadoop1 ~]# yum install java-1.8.0-openjdk-devel.x86_64 -y

Enter to keep the current selection[+], or type selection number: ^C
[root@hadoop3 ~]# cat /etc/redhat-release
CentOS release 6.10 (Final)

switch user to hadoop and download hadoop

[root@hadoop1 ~]# su - hadoop
[hadoop@hadoop1 ~]$ wget

Set Environment Variables

[hadoop@hadoop3 ~]$ cat /home/hadoop/.bashrc
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc

# User specific aliases and functions
export HADOOP_HOME=/home/hadoop/hadoop

[hadoop@hadoop3 ~]$ source  /home/hadoop/.bashrc

Set JAVA_HOME on all the nodes:-

[hadoop@hadoop3 ~]$ cat ~/hadoop/etc/hadoop/ | grep -i export
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

Setup the NameNode Location in all the nodes, use master server ip address

[hadoop@hadoop3 ~]$ cat ~/hadoop/etc/hadoop/core-site.xml

<!-- Put site-specific property overrides in this file. -->


Set path for HDFS

[hadoop@hadoop3 ~]$ cat ~/hadoop/etc/hadoop/hdfs-site.xml





The last property, dfs.replication, indicates how many times data is replicated in the cluster.
You can set 2 to have all the data duplicated on the two slave nodes. 

Set YARN as Job Scheduler
[hadoop@hadoop3 ~]$ cat  ~/hadoop/etc/hadoop/mapred-site.xml





Configure YARN

[hadoop@hadoop3 ~]$ cat  ~/hadoop/etc/hadoop/yarn-site.xml




  <value> $HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*,$HADOOP_COMMON_HOME/share/hadoop/common/lib/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,






Configure Slaves

Make entries for the slave nodes in the below 2 configuration files:-

[hadoop@hadoop1 hadoop]$ cat /home/hadoop/hadoop/etc/hadoop/slaves
[hadoop@hadoop1 hadoop]$ cat /home/hadoop/hadoop/etc/hadoop/workers

Copy Config files and hadoop binaries to all the nodes:

cd /home/hadoop/
scp hadoop-*.tar.gz hadoop2:/home/hadoop
scp hadoop-*.tar.gz hadoop3:/home/hadoop
Connect to hadoop2 and hadoop3, execute the below commands

Unzip the binaries, rename the directory, and exit node1 to get back on the node-master:

tar -xzf hadoop-3.2.0.tar.gz
mv hadoop-3.2.0 hadoop

Copy the Hadoop configuration files to the slave nodes from hadoop1 node:

for node in hadoop2 hadoop3; do
    scp ~/hadoop/etc/hadoop/* $node:/home/hadoop/hadoop/etc/hadoop/;

Format HDFS
HDFS needs to be formatted like any classical file system. On node-master, run the following command:

hdfs namenode -format
Your Hadoop installation is now configured and ready to run.

Run and monitor HDFS
This section will walk through starting HDFS on NameNode and DataNodes, and monitoring that everything is properly working and interacting with HDFS data.

Start and Stop HDFSPermalink
Start the HDFS by running the following script from node-master:
It’ll start NameNode and SecondaryNameNode on node-master hadoop1, and DataNode on hadoop2 and hadoop3, according to the configuration in the slaves/worker config file.

Check that every process is running with the jps command on each node. You should get on node-master (PID will be different):

21922 Jps
21603 NameNode
21787 SecondaryNameNode
and on node1 and node2:

19728 DataNode
19819 Jps
To stop HDFS on master and slave nodes, run the following command from node-master hadoop1:

Monitor your HDFS Cluster

You can get useful information about running your HDFS cluster with the hdfs dfsadmin command. Try for example:

hdfs dfsadmin -report
This will print information (e.g., capacity and usage) for all running DataNodes. To get the description of all available commands, type:

hdfs dfsadmin -help

name node will be accesible at:-

Put and Get Data to HDFS
Writing and reading to HDFS is done with command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory:

hdfs dfs -mkdir -p /user/hadoop
Let’s use some textbooks from the Gutenberg project as an example.

Create a books directory in HDFS. The following command will create it in the home directory, /user/hadoop/books:

hdfs dfs -mkdir books
Grab a few books from the Gutenberg project:

cd /home/hadoop
wget -O alice.txt
wget -O holmes.txt
wget -O frankenstein.txt

Put the three books through HDFS, in the booksdirectory:

hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
List the contents of the book directory:

hdfs dfs -ls books
Move one of the books to the local filesystem:

hdfs dfs -get books/alice.txt
You can also directly print the books from HDFS:

hdfs dfs -cat books/alice.txt
There are many commands to manage your HDFS. For a complete list, you can look at the Apache HDFS shell documentation, or print help with:

hdfs dfs -help
HDFS is a distributed storage system, it doesn’t provide any services for running and scheduling tasks in the cluster. This is the role of the YARN framework. The following section is about starting, monitoring, and submitting jobs to YARN.

Start and Stop YARNPermalink
Start YARN with the script:
Check that everything is running with the jps command. In addition to the previous HDFS daemon, you should see a ResourceManager on node-master, and a NodeManager on node1 and node2.

To stop YARN, run the following command on node-master:
Monitor YARNPermalink
The yarn command provides utilities to manage your YARN cluster. You can also print a report of running nodes with the command:

yarn node -list
Similarly, you can get a list of running applications with command:

yarn application -list
To get all available parameters of the yarn command, see Apache YARN documentation.

As with HDFS, YARN provides a friendlier web UI, started by default on port 8088 of the Resource Manager. 
Point your browser to and browse the UI:

Submit MapReduce Jobs to YARN
Yarn jobs are packaged into jar files and submitted to YARN for execution with the command yarn jar. The Hadoop installation package provides sample applications that can be run to test your cluster. You’ll use them to run a word count on the three books previously uploaded to HDFS.

Submit a job with the sample jar to YARN. On node-master, run:

yarn jar /home/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount "books/*" output

The last argument is where the output of the job will be saved - in HDFS.

After the job is finished, you can get the result by querying HDFS with hdfs dfs -ls output. In case of a success, the output will resemble:

Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2017-10-11 14:09 output/_SUCCESS
-rw-r--r--   1 hadoop supergroup     269158 2017-10-11 14:09 output/part-r-00000
Print the result with:

hdfs dfs -cat output/part-r-00000

That's it.!!!!


Blog tags