In Part-1 we have successfully created, launched and connected to Amazon Ubuntu Instances. In Part-2 I will show how to install and setup Hadoop cluster. If you are seeing this page first time, I would strongly advise you to go over Part-1.
In this article
- HadoopNameNode will be referred as master,
- HadoopSecondaryNameNode will be referred as SecondaryNameNode or SNN
- HadoopSlave1 and HadoopSlave2 will be referred as slaves (where data nodes will reside)
So, let’s begin.
1. Apache Hadoop Installation and Cluster Setup
1.1 Update the packages and dependencies.
Let’s update the packages , I will start with master , repeat this for SNN and 2 slaves.
$ sudo apt-get update
Once its complete, let’s install java
1.2 Install Java
Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update && sudo apt-get install oracle-jdk7-installer
Check if Ubuntu uses JDK 7
Repeat this for SNN and 2 slaves.
1.3 Download Hadoop
I am going to use haddop 1.2.1 stable version from apache download page and here is the 1.2.1 mirror
issue wget command from shell
$ wget http://apache.mirror.gtcomm.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Unzip the files and review the package content and configuration files.
$ tar -xzvf hadoop-1.2.1.tar.gz
For simplicity, rename the ‘hadoop-1.2.1’ directory to ‘hadoop’ for ease of operation and maintenance.
$ mv hadoop-1.2.1 hadoop
1.4 Setup Environment Variable
Setup Environment Variable for ‘ubuntu’ user
Update the .bashrc file to add important Hadoop paths and directories.
Navigate to home directory
$cd
Open .bashrc file in vi edito
$ vi .bashrc
Add following at the end of file
export HADOOP_CONF=/home/ubuntu/hadoop/conf
export HADOOP_PREFIX=/home/ubuntu/hadoop
#Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin/ directory to path
export PATH=$PATH:$HADOOP_PREFIX/bin
Save and Exit.
To check whether its been updated correctly or not, reload bash profile, use following commands
1.5 Setup Password-less SSH on Servers
- ‘ssh-agent’ is a background program that handles passwords for SSH private keys.
- ‘ssh-add’ command prompts the user for a private key password and adds it to the list maintained by ssh-agent. Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.
Amazon EC2 Instance has already taken care of ‘authorized_keys’ on master server, execute following commands to allow password-less SSH access to slave servers.
First of all we need to protect our keypair files, if the file permissions are too open (see below) you will get an error
To fix this problem, we need to issue following commands
$ chmod 644 authorized_keys
Quick Tip: If you set the permissions to ‘chmod 644’, you get a file that can be written by you, but can only be read by the rest of the world.
$ chmod 400 haddoec2cluster.pem
Quick Tip: chmod 400 is a very restrictive setting giving only the file onwer read-only access. No write / execute capabilities for the owner, and no permissions what-so-ever for anyone else.
To use ssh-agent and ssh-add, follow the steps below:
- At the Unix prompt, enter: eval `ssh-agent`Note: Make sure you use the backquote (
`), located under the tilde (~), rather than the single quote ('). - Enter the command: ssh-add hadoopec2cluster.pem
if you notice .pem file has “read-only” permission now and this time it works for us.
Remote SSH
Let’s verify that we can connect into SNN and slave nodes from master
1.6 Hadoop Cluster Setup
This section will cover the hadoop cluster configuration. We will have to modify
- hadoop-env.sh – This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change at this point is in this file is JAVA_HOME, which specifies the path to the Java 1.7.x installation used by Hadoop.
- core-site.xml – key property fs.default.name – for namenode configuration for e.g hdfs://
namenode/ - hdfs-site.xml – key property – dfs.replication – by default 3
- mapred-site.xml – key property mapred.job.tracker for jobtracker configuration for e.g
jobtracker:8021
We will first start with master (NameNode) and then copy above xml changes to remaining 3 nodes (SNN and slaves)
Finally, in section 1.6.2 we will have to configure conf/masters and conf/slaves.
- masters – defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster.
- slaves – defines the lists of hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.
Lets go over one by one. Start with masters (namenode).
hadoop-env.sh
$ vi $HADOOP_CONF/hadoop-env.sh and add JAVA_HOME shown below and save changes.
core-site.xml
This file contains configuration settings for Hadoop Core (for e.g I/O) that are common to HDFS and MapReduce Default file system configuration property – fs.default.name goes here it could for e.g hdfs / s3 which will be used by clients.
$ vi $HADOOP_CONF/core-site.xml
We are going t0 add two properties
- fs.default.name will point to NameNode URL and port (usually 8020)
- hadoop.tmp.dir – A base for other temporary directories. Its important to note that every node needs hadoop tmp directory. I am going to create a new directory “hdfstmp” as below in all 4 nodes. Ideally you can write a shell script to do this for you, but for now going the manual way.
$ cd
$ mkdir hdfstmp
Quick Tip: Some of the important directories are dfs.name.dir, dfs.data.dir in hdfs-site.xml. The default value for the dfs.name.dir is ${hadoop.tmp.dir}/dfs/data and dfs.data.dir is${hadoop.tmp.dir}/dfs/data. It is critical that you choose your directory location wisely in production environment.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-54-209-221-112.compute-1.amazonaws.com:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hdfstmp</value>
</property>
</configuration>
hdfs-site.xml
This file contains the configuration for HDFS daemons, the NameNode, SecondaryNameNode and data nodes.
We are going to add 2 properties
- dfs.permissions.enabled with value false, This means that any user, not just the “hdfs” user, can do anything they want to HDFS so do not do this in production unless you have a very good reason. if “true”, enable permission checking in HDFS. If “false”, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. Be very careful before you set this
- dfs.replication – Default block replication is 3. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we have 2 slave nodes we will set this value to 2.
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
mapred-site.xml
This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers.
The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker for Task Trackers and MapReduce clients.
JobTracker will be running on master (NameNode)
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://ec2-54-209-221-112.compute-1.amazonaws.com:8021</value>
</property>
</configuration>
1.6.1 Move configuration files to Slaves
Now, we are done with hadoop xml files configuration master, lets copy the files to remaining 3 nodes using secure copy (scp)
start with SNN, if you are starting a new session, follow ssh-add as per section 1.5
from master’s unix shell issue below command
$ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml ubuntu@ec2-54-209-221-47.compute-1.amazonaws.com:/home/ubuntu/hadoop/conf
repeat this for slave nodes
1.6.2 Configure Master and Slaves
Every hadoop distribution comes with master and slaves files. By default it contains one entry for localhost, we have to modify these 2 files on both “masters” (HadoopNameNode) and “slaves” (HadoopSlave1 and HadoopSlave2) machines – we have a dedicated machine for HadoopSecondaryNamdeNode.
1.6.3 Modify masters file on Master machine
conf/masters file defines on which machines Hadoop will start Secondary NameNodes in our multi-node cluster. In our case, there will be two machines HadoopNameNode and HadoopSecondaryNameNode
Hadoop HDFS user guide : “The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by “bin/start-dfs.sh“ on the nodes specified in “conf/masters“ file.“
$ vi $HADOOP_CONF/masters and provide an entry for the hostename where you want to run SecondaryNameNode daemon. In our case HadoopNameNode and HadoopSecondaryNameNode
1.6.4 Modify the slaves file on master machine
The slaves file is used for starting DataNodes and TaskTrackers
$ vi $HADOOP_CONF/slaves
1.6.5 Copy masters and slaves to SecondaryNameNode
Since SecondayNameNode configuration will be same as NameNode, we need to copy master and slaves to HadoopSecondaryNameNode.
1.6.7 Configure master and slaves on “Slaves” node
Since we are configuring slaves (HadoopSlave1 & HadoopSlave2) , masters file on slave machine is going to be empty
$ vi $HADOOP_CONF/masters
Next, update the ‘slaves’ file on Slave server (HadoopSlave1) with the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Node in the cluster.
$ vi $HADOOP_CONF/slaves
Similarly update masters and slaves for HadoopSlave2
1.7 Hadoop Daemon Startup
The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which runs on top of your , which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased.
To format the namenode
$ hadoop namenode -format
Lets start all hadoop daemons from HadoopNameNode
$ cd $HADOOP_CONF
$ start-all.sh
This will start
- NameNode,JobTracker and SecondaryNameNode daemons on HadoopNameNode
- SecondaryNameNode daemons on HadoopSecondaryNameNode
- and DataNode and TaskTracker daemons on slave nodes HadoopSlave1 and HadoopSlave2
We can check the namenode status from http://ec2-54-209-221-112.compute-1.amazonaws.com:50070/dfshealth.jsp
Check Jobtracker status : http://ec2-54-209-221-112.compute-1.amazonaws.com:50030/jobtracker.jsp
Slave Node Status for HadoopSlave1 : http://ec2-54-209-223-7.compute-1.amazonaws.com:50060/tasktracker.jsp
To quickly verify our setup, run the hadoop pi example
ubuntu@ec2-54-209-221-112:~/hadoop$ hadoop jar hadoop-examples-1.2.1.jar pi 10 1000000

























please give me solution for this as soon as possible
Nice work!!!
thanks a lot.
BR Vicky
thanks for the great work..very useful..
Works! Thank you.
So when I restart aws instances, I need to update public-dns for each server every time ??
your public dns is not elastic, so when you restart your instances your instance gets new public-dns. So solution is
get elastic ip address to your instance or use privite ip address instead public-dns.
thanks,
Kattula.
Thanks Kattula for your suggestions!
Great Article!! Thank you for posting.
Can we skip the secondary name node? I started 3 instances, and would like to have 2 slaves at least. How can I do that?
m getting a problem connecting to namenode.. how can i solve it ? please help
Hi
I successfully did installation till 1.7 of this tutorial but not getting http://ec2-54-2XX-2X1-XX2.compute-1.amazonaws.com:50070/dfshealth.jsp
This showing down. any help ?
please reply me earliest you can
Hi
http://ec2-52-35-122-71.us-west-2.compute.amazonaws.com:50070/dfshealth.jsp
it worked for me now. please help me I am not able to browse filesystem.
http://ip-172-31-45-148.us-west-2.compute.internal:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/
what should be solution ?
Thanks
thats your DataNode at Port 50075 -is your DataNode up?
When I start the services it states “starting namenode,datanode,secondarynamenode etc.” but when I do jps it shows me only jps service running.
What could be the reason…please help.
thanks. this is so useful
Great tutorial
I was able to successfully complete what you have shown in part1 , in part 2 after performing the below steps successfully , I am unable to connect to other data nodes
Can you please revist your steps.I keep getting authenticity of host cannot be established , do we need to do something in data nodes
I’ve been absent for some time, but now I remember why I used to love this web site. Thank you, I’ll try and check back more frequently. How frequently you update your website?
Thanks a lot it s very easy to follow the steps.
Thanks a lot! It helps me to understand and configure the multinode cluster.
Thank you so much!
it works!
Great article. The detailing is very precise.
Can someone explain me the commands flow that we need to enter for the section Setup Password-less SSH on Servers. Not able to continue from here.
Nice tutorial, thanks for sharing it. It helped me a lot.