Setting up Hadoop multi-node cluster on Amazon EC2 Part 2

In Part-1 we have successfully created, launched and connected to Amazon Ubuntu Instances. In Part-2 I will show how to install and setup Hadoop cluster. If you are seeing this page first time, I would strongly advise you to go over Part-1.

In this article

  • HadoopNameNode will be referred as master,
  • HadoopSecondaryNameNode will be referred as SecondaryNameNode or SNN
  • HadoopSlave1 and HadoopSlave2 will be referred as slaves (where data nodes will reside)

So, let’s begin.

1. Apache Hadoop Installation and Cluster Setup

1.1 Update the packages and dependencies.

Let’s update the packages , I will start with master , repeat this for SNN and 2 slaves.

$ sudo apt-get update

Once its complete, let’s install java

1.2 Install Java

Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update && sudo apt-get install oracle-jdk7-installer

Check if Ubuntu uses JDK 7

java_installaationRepeat this for SNN and 2 slaves.

 1.3 Download Hadoop

I am going to use haddop 1.2.1 stable version from apache download page and here is the 1.2.1 mirror

issue wget command from shell

$ wget


Unzip the files and review the package content and configuration files.

$ tar -xzvf hadoop-1.2.1.tar.gz


For simplicity, rename the ‘hadoop-1.2.1’ directory to ‘hadoop’ for ease of operation and maintenance.

$ mv hadoop-1.2.1 hadoop


1.4 Setup Environment Variable

Setup Environment Variable for ‘ubuntu’ user

Update the .bashrc file to add important Hadoop paths and directories.

Navigate to home directory


Open .bashrc file in vi edito

$ vi .bashrc

Add following at the end of file

export HADOOP_CONF=/home/ubuntu/hadoop/conf

export HADOOP_PREFIX=/home/ubuntu/hadoop


export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Add Hadoop bin/ directory to path

Save and Exit.

To check whether its been updated correctly or not, reload bash profile, use following commands

source ~/.bashrc
Repeat 1.3 and 1.4  for remaining 3 machines (SNN and 2 slaves).

1.5 Setup Password-less SSH on Servers

Master server remotely starts services on salve nodes, whichrequires password-less access to Slave Servers. AWS Ubuntu server comes with pre-installed OpenSSh server.
Quick Note:
The public part of the key loaded into the agent must be put on the target system in ~/.ssh/authorized_keys. This has been taken care of by the AWS Server creation process
Now we need to add the AWS EC2 Key Pair identity haddopec2cluster.pem to SSH profile. In order to do that we will need to use following ssh utilities
  • ‘ssh-agent’ is a background program that handles passwords for SSH private keys.
  •  ‘ssh-add’ command prompts the user for a private key password and adds it to the list maintained by ssh-agent. Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.

Amazon EC2 Instance  has already taken care of ‘authorized_keys’ on master server, execute following commands to allow password-less SSH access to slave servers.

First of all we need to protect our keypair files, if the file permissions are too open (see below) you will get an error


To fix this problem, we need to issue following commands

$ chmod 644 authorized_keys

Quick Tip: If you set the permissions to ‘chmod 644’, you get a file that can be written by you, but can only be read by the rest of the world.

$ chmod 400 haddoec2cluster.pem

Quick Tip: chmod 400 is a very restrictive setting giving only the file onwer read-only access. No write / execute capabilities for the owner, and no permissions what-so-ever for anyone else.

To use ssh-agent and ssh-add, follow the steps below:

  1. At the Unix prompt, enter: eval `ssh-agent`Note: Make sure you use the backquote ( ` ), located under the tilde ( ~ ), rather than the single quote ( ' ).
  2. Enter the command: ssh-add hadoopec2cluster.pem

if you notice .pem file has “read-only” permission now and this time it works for us.

Keep in mind ssh session will be lost upon shell exit and you have repeat ssh-agent and ssh-add commands.

Remote SSH

Let’s verify that we can connect into SNN and slave nodes from master


$ ssh ubuntu@<your-amazon-ec2-public URL>
On successful login the IP address on the shell will change.

1.6 Hadoop Cluster Setup

This section will cover the hadoop cluster configuration.  We will have to modify

  • – This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change at this point is in this file is JAVA_HOME, which specifies the path to the Java 1.7.x installation used by Hadoop.
  • core-site.xml –  key property – for namenode configuration for e.g hdfs://namenode/
  • hdfs-site.xml – key property – dfs.replication – by default 3
  • mapred-site.xml  – key property  mapred.job.tracker for jobtracker configuration for e.g jobtracker:8021

We will first start with master (NameNode) and then copy above xml changes to remaining 3 nodes (SNN and slaves)

Finally, in section 1.6.2 we will have to configure conf/masters and conf/slaves.

  • masters – defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster.
  • slaves –  defines the lists of hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.

Lets go over one by one. Start with masters (namenode).

$ vi $HADOOP_CONF/  and add JAVA_HOME shown below and save changes.


This file contains configuration settings for Hadoop Core (for e.g I/O) that are common to HDFS and MapReduce Default file system configuration property –  goes here it could for e.g hdfs / s3 which will be used by clients.

$ vi $HADOOP_CONF/core-site.xml

We are going t0 add two properties

  •  will point to NameNode URL and port (usually 8020)
  • hadoop.tmp.dir  – A base for other temporary directories. Its important to note that every node needs hadoop tmp directory.  I am going to create a new directory “hdfstmp”  as below in all 4 nodes. Ideally you can write a shell script to do this for you, but for now going the manual way.

$ cd

$ mkdir hdfstmp

Quick Tip:  Some of the important directories are, in hdfs-site.xml. The default value for the is ${hadoop.tmp.dir}/dfs/data and is${hadoop.tmp.dir}/dfs/data. It is critical that you choose your directory location wisely in production environment.






This file contains the configuration for HDFS daemons, the NameNode, SecondaryNameNode  and data nodes.

We are going to add 2 properties

  • dfs.permissions.enabled  with value false,  This means that any user, not just the “hdfs” user, can do anything they want to HDFS so do not do this in production unless you have a very good reason. if “true”, enable permission checking in HDFS. If “false”, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. Be very careful before you set this
  • dfs.replication  – Default block replication is 3. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we have 2 slave nodes we will set this value to 2.




This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers.
The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker for Task Trackers and MapReduce clients.

JobTracker will be running on master (NameNode)


1.6.1 Move configuration files to Slaves

Now, we are done with hadoop xml files configuration master, lets copy the files to remaining 3 nodes using secure copy (scp)

start with SNN, if you are starting a new session, follow ssh-add as per section 1.5

from master’s unix shell issue below command

$ scp core-site.xml hdfs-site.xml mapred-site.xml

repeat this for slave nodes


1.6.2 Configure Master and Slaves

Every hadoop distribution comes with master and slaves files. By default it contains one entry for localhost, we have to modify these 2 files on both “masters” (HadoopNameNode) and “slaves” (HadoopSlave1 and HadoopSlave2) machines – we have a dedicated machine for HadoopSecondaryNamdeNode.



1.6.3 Modify masters file on Master machine

conf/masters file defines on which machines Hadoop will start Secondary NameNodes in our multi-node cluster. In our case, there will be two machines HadoopNameNode and HadoopSecondaryNameNode

Hadoop HDFS user guide : “The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by “bin/“ on the nodes specified in “conf/masters“ file.

$ vi $HADOOP_CONF/masters and provide an entry for the hostename where you want to run SecondaryNameNode daemon. In our case HadoopNameNode and HadoopSecondaryNameNode


1.6.4 Modify the slaves file on master machine

The slaves file is used for starting DataNodes and TaskTrackers

$ vi $HADOOP_CONF/slaves


1.6.5 Copy masters and slaves to SecondaryNameNode

Since SecondayNameNode configuration will be same as NameNode, we need to copy master and slaves to HadoopSecondaryNameNode.


1.6.7 Configure master and slaves on “Slaves” node

Since we are configuring slaves (HadoopSlave1 & HadoopSlave2) , masters file on slave machine is going to be empty

$ vi $HADOOP_CONF/masters


Next, update the ‘slaves’ file on Slave server (HadoopSlave1) with the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Node in the cluster.

$ vi $HADOOP_CONF/slaves


Similarly update masters and slaves for HadoopSlave2

1.7 Hadoop Daemon Startup

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which runs on top of your , which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

To format the namenode

$ hadoop namenode -format


Lets start all hadoop daemons from HadoopNameNode



This will start

  • NameNode,JobTracker and SecondaryNameNode daemons on HadoopNameNode


  • SecondaryNameNode daemons on HadoopSecondaryNameNode


  • and DataNode and TaskTracker daemons on slave nodes HadoopSlave1 and HadoopSlave2

Screen shot 2014-01-13 at 9.59.38 AM

Screen shot 2014-01-13 at 9.59.56 AM

We can check the namenode status from


Check Jobtracker status :


Slave Node Status for HadoopSlave1 :


Slave Node Status for HadoopSlave2 :


To quickly verify our setup, run the hadoop pi example

ubuntu@ec2-54-209-221-112:~/hadoop$ hadoop jar hadoop-examples-1.2.1.jar pi 10 1000000

Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/01/13 15:44:12 INFO mapred.FileInputFormat: Total input paths to process : 10
14/01/13 15:44:13 INFO mapred.JobClient: Running job: job_201401131425_0001
14/01/13 15:44:14 INFO mapred.JobClient:  map 0% reduce 0%
14/01/13 15:44:32 INFO mapred.JobClient:  map 20% reduce 0%
14/01/13 15:44:33 INFO mapred.JobClient:  map 40% reduce 0%
14/01/13 15:44:46 INFO mapred.JobClient:  map 60% reduce 0%
14/01/13 15:44:56 INFO mapred.JobClient:  map 80% reduce 0%
14/01/13 15:44:58 INFO mapred.JobClient:  map 100% reduce 20%
14/01/13 15:45:03 INFO mapred.JobClient:  map 100% reduce 33%
14/01/13 15:45:06 INFO mapred.JobClient:  map 100% reduce 100%
14/01/13 15:45:09 INFO mapred.JobClient: Job complete: job_201401131425_0001
14/01/13 15:45:09 INFO mapred.JobClient: Counters: 30
14/01/13 15:45:09 INFO mapred.JobClient:   Job Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/13 15:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=145601
14/01/13 15:45:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/13 15:45:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/13 15:45:09 INFO mapred.JobClient:     Launched map tasks=10
14/01/13 15:45:09 INFO mapred.JobClient:     Data-local map tasks=10
14/01/13 15:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=33926
14/01/13 15:45:09 INFO mapred.JobClient:   File Input Format Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Bytes Read=1180
14/01/13 15:45:09 INFO mapred.JobClient:   File Output Format Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Bytes Written=97
14/01/13 15:45:09 INFO mapred.JobClient:   FileSystemCounters
14/01/13 15:45:09 INFO mapred.JobClient:     FILE_BYTES_READ=226
14/01/13 15:45:09 INFO mapred.JobClient:     HDFS_BYTES_READ=2740
14/01/13 15:45:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=622606
14/01/13 15:45:09 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
14/01/13 15:45:09 INFO mapred.JobClient:   Map-Reduce Framework
14/01/13 15:45:09 INFO mapred.JobClient:     Map output materialized bytes=280
14/01/13 15:45:09 INFO mapred.JobClient:     Map input records=10
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce shuffle bytes=280
14/01/13 15:45:09 INFO mapred.JobClient:     Spilled Records=40
14/01/13 15:45:09 INFO mapred.JobClient:     Map output bytes=180
14/01/13 15:45:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=2039111680
14/01/13 15:45:09 INFO mapred.JobClient:     CPU time spent (ms)=9110
14/01/13 15:45:09 INFO mapred.JobClient:     Map input bytes=240
14/01/13 15:45:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1560
14/01/13 15:45:09 INFO mapred.JobClient:     Combine input records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce input records=20
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce input groups=20
14/01/13 15:45:09 INFO mapred.JobClient:     Combine output records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1788379136
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce output records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=10679681024
14/01/13 15:45:09 INFO mapred.JobClient:     Map output records=20
Job Finished in 57.825 seconds
Estimated value of Pi is 3.14158440000000000000
You can check the job tracker status page to look at complete job status
Drill down into completed job and you can see more details on Map Reduce tasks.
At last do not forget to terminate your amazon ec2 instances or you will be continued to get charged
That’s it for this article, hope you find it useful
Happy Hadoop Year!

Setting up Hadoop multi-node cluster on Amazon EC2 Part 1

After spending some time playing around on Single-Node pseudo-distributed cluster its time to get into real world hadoop. Depending on what works best – Its important to note that there are multiple ways to achieve this and I am going to cover how to setup multi-node hadoop cluster on Amazon EC2. We are going to setup 4 node hadoop cluster as below.

  • NameNode (Master)
  • SecondaryNameNode
  • DataNode (Slave1)
  • DataNode (Slave2)

Here’s what you will need

  1. Amazon AWS Account
  2. PuTTy Windows Client (to connect to Amazon EC2 instance)
  3. PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
  4. WinSCP (secury copy)

This will be a two part series

In Part-1 I will cover infrastructure side as below

  1. Setting up Amazon EC2 Instances
  2. Setting up client access to Amazon Instances (using Putty.)
  3. Setup WinSCP access to EC2 instances

In Part-2 I will cover the hadoop multi node cluster installation

  1. Hadoop Multi-Node Installation and setup

1. Setting up Amazon EC2 Instances

With 4 node clusters and minimum volume size of 8GB there would be an average $2 of charge per day with all 4 running instances. You can stop the instance anytime to avoid the charge, but you will loose the public IP and host and restarting the instance will create new ones,. You can also terminate your Amazon EC2 instance anytime and by default it will delete your instance upon termination, so just be careful what you are doing.

1.1 Get Amazon AWS Account

If you do not already have a account, please create a new one. I already have AWS account and going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.

Screen shot 2014-01-10 at 1.33.28 PM

1.2 Launch Instance

Once you have signed up for Amazon account. Login to Amazon Web Services, click on My Account and navigate to Amazon EC2 Console


1.3 Select AMI

I am picking Ubuntu Server 12.04.3  Server 64-bit OS

Step_1_choose_AMI1.4 Select Instance Type

Select the micro instance


1.5 Configure Number of Instances

As mentioned we are setting up 4 node hadoop cluster, so please enter 4 as number of instances. Please check Amazon EC2 free-tier requirements, you may setup 3 node cluster with < 30GB storage size to avoid any charges.  In production environment you want to have SecondayNameNode as separate machine


1.6 Add Storage

Minimum volume size is 8GB


1.7 Instance Description

Give your instance name and description

Step_5_Instance_description1.8 Define a Security Group

Create a new security group, later on we are going to modify the security group with security rules.


1.9 Launch Instance and Create Security Pair

Review and Launch Instance.

Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.

Create a new keypair and give it a name “hadoopec2cluster” and download the keypair (.pem) file to your local machine. Click Launch Instance


1.10 Launching Instances

Once you click “Launch Instance” 4 instance should be launched with “pending” state


Once in “running” state we are now going to rename the instance name as below.

  1. HadoopNameNode (Master)
  2. HadoopSecondaryNameNode
  3. HadoopSlave1 (data node will reside here)
  4. HaddopSlave2  (data node will reside here)


Please note down the Instance ID, Public DNS/URL (  and Public IP for each instance for your reference.. We will need it later on to connect from Putty client.  Also notice we are using “HadoopEC2SecurityGroup”.


You can use the existing group or create a new one. When you create a group with default options it add a rule for SSH at port 22.In order to have TCP and ICMP access we need to add 2 additional security rules. Add ‘All TCP’, ‘All ICMP’ and ‘SSH (22)’ under the inbound rules to “HadoopEC2SecurityGroup”. This will allow ping, SSH, and other similar commands among servers and from any other machine on internet. Make sure to “Apply Rule changes” to save your changes.

These protocols and ports are also required to enable communication among cluster servers. As this is a test setup we are allowing access to all for TCP, ICMP and SSH and not bothering about the details of individual server port and security.


2.Setting up client access to Amazon Instances

Now, lets make sure we can connect to all 4 instances.For that we are going to use Putty client We are going setup password-less SSH access among servers to setup the cluster. This allows remote access from Master Server to Slave Servers so Master Server can remotely start the Data Node and Task Tracker services on Slave servers.

We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In order to generate the private key we need Puttygen client. You can download the putty and puttygen and various utilities in zip from here.

2.1 Generating Private Key

Let’s launch PUTTYGEN client and import the key pair we created during launch instance step – “hadoopec2cluster.pem”

Navigate to Conversions and “Import Key”


load_private_keyOnce you import the key You can enter passphrase to protect your private key or leave the passphrase fields blank to use the private key without any passphrase. Passphrase protects the private key from any unauthorized access to servers using your machine and your private key.

Any access to server using passphrase protected private key will require the user to enter the passphrase to enable the private key enabled access to AWS EC2 server.

2.2 Save Private Key

Now save the private key by clicking on “Save Private Key” and click “Yes” as  we are going to leave passphrase empty.


Save the .ppk file and give it a meaningful name


Now we are ready to connect to our Amazon Instance Machine for the first time.

2.3 Connect to Amazon Instance

Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the .ppk private key that we just created for password-less SSH access. As per amazon documentation, for Ubuntu machines username is “ubuntu”

2.3.1 Provide private key for authentication


2.3.2 Hostname and Port and Connection Type

and “Open” to launch putty session


when you launch the session first time, you will see below message, click “Yes”connect_to_HadoopNameNode

and will prompt you for the username, enter ubuntu, if everything goes well you will be presented welcome message with Unix shell at the end.


If there is a problem with your key, you may receive below error messageputty_error_message

Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2 respectively to make sure you can connect successfully.


2.4 Enable Public Access

Issue ifconfig command and note down the ip address. Next, we are going to update the hostname with ec2 public URL and finally we are going to update /etc/hosts file to map  the ec2 public URL with ip address. This will help us to configure master ans slaves nodes with hostname instead of ip address.

Following is the output on HadoopNameNode ifconfig



now, issue the hostname command, it will display the ip address same as inet address from ifconfig command.


We need to modify the hostname to ec2 public URL with below command

sudo hostname


2.5 Modify /etc/hosts

Lets change the host to EC2 public IP and hostname.

Open the /etc/hosts in vi, in a very first line it will show localhost, we need to replace that with amazon ec2 hostname and ip address we just collected.


Modify the file and save your changes


Repeat 2.3 and 2.4 sections for remaining 3 machines.

3. Setup WinSCP access to EC2 instances

In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a handy utility.

Provide hostname, username and private key file and save your configuration and Login



If you see above error, just ignore and you upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2 Ubuntu machine.


Upload the .pem file to master machine (HadoopNameNode). It will be used while connecting to slave nodes during hadoop startup daemons.

If you have made this far, good work! Things will start to get more interesting in Part-2.

How to install Pig & Hive on Linux Mint VM

First lets start with Hive

Running Hive

Hive uses Hadoop, so:

  • you must have Hadoop in your path OR
  • export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive.

Commands to perform this setup:

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

You may find it useful, though it’s not necessary, to set HIVE_HOME:

  $ export HIVE_HOME=<hive-install-dir>

Installing Hive

  • Download a recent stable release from one of the Apache Download Mirror

$ wget

  • extract the downloaded file

$ tar -xzvf hive-0.12.0-bin.tar.gz

  • move the extracted files into new folder (optional)

$ mv hive-0.12.0-bin hive

  • navigate to hive  directory and add the hive installation directory into classpath

$ cd hive

$ pwd
$ export HIVE_HOME=/home/hduser/hive

$ export PATH=$HIVE_HOME/bin:$PATH

  • start hive by typing , run show tables; command to make sure hive is running

$ hive

Now, lets install Pig 

1. Download a recent stable release from one of the Apache Download Mirrors

$ wget

2.Unpack the downloaded Pig distribution, a

$ tar -xzvf pig-0.12.0.tar.gz

3.move directory (optional)

$ mv pig-0.12.0 pig

4.go to pig installation directory

$ cd pig

5. add pig installation directory to classpath

$ pwd


$ export PIG_HOME=/home/hduser/pig

$ export PATH=$PIG_HOME/bin:$PATH

6. Test the pig installation with simple command
$pig -help

Running Hadoop MapReduce Application from Eclipse Kepler

Its very important to learn hadoop by pracitce.

One of the learning curve is how to write first map reduce app and debug in favorite IDE Eclipse? Do we need any Eclipse plugins? No, we do not. We can do haooop development without map reduce plugins

This tutorial will show you how to setup eclipse and run you map reduce project and MapReduce job right from IDE. Before you read further, you should have setup Hadoop single node cluster and your machine.

You can download the eclipse project from GitHub

Use Case:

We will explore the weather data to find maximum temperature from Tom White’s book Hadoop: Definitive Guide (3rd edition) Chapter 2 and run it using ToolRunner

I am using linux mint 15 on VirtualBox VM instance.

In addition,you should have

  1. Hadoop (MRV1 am using 1.2.1) Single Node Cluster Installed and Running, If you have not done so, would strongly recommend you do it from here 
  2. Download Eclipse IDE, as of writing this, latest version of Eclipse is Kepler

1.Create New Java Project


2.Add Dependencies JARs

Right click on project properties and select Java build path

add all jars from $HADOOP_HOME/lib and $HADOOP_HOME (where hadoop core and tools jar lives)



3. Create Mapper

package com.letsdobigdata;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper extends
 Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
 public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
String line = value.toString();
 String year = line.substring(15, 19);
 int airTemperature;
 if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
 // signs
 airTemperature = Integer.parseInt(line.substring(88, 92));
 } else {
 airTemperature = Integer.parseInt(line.substring(87, 92));
 String quality = line.substring(92, 93);
 if (airTemperature != MISSING && quality.matches("[01459]")) {
 context.write(new Text(year), new IntWritable(airTemperature));

4. Create Reducer

package com.letsdobigdata;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
 Context context)
 throws IOException, InterruptedException {

 int maxValue = Integer.MIN_VALUE;
 for (IntWritable value : values) {
 maxValue = Math.max(maxValue, value.get());
 context.write(key, new IntWritable(maxValue));

5. Create Driver for MapReduce Job

Map Reduce job is executed by useful hadoop utility class ToolRunner

package com.letsdobigdata;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
/*This class is responsible for running map reduce job*/
public class MaxTemperatureDriver extends Configured implements Tool{
public int run(String[] args) throws Exception

 if(args.length !=2) {
 System.err.println("Usage: MaxTemperatureDriver <input path> <outputpath>");

 Job job = new Job();
 job.setJobName("Max Temperature");

 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job,new Path(args[1]));



 System.exit(job.waitForCompletion(true) ? 0:1); 
 boolean success = job.waitForCompletion(true);
 return success ? 0 : 1;
public static void main(String[] args) throws Exception {
 MaxTemperatureDriver driver = new MaxTemperatureDriver();
 int exitCode =, args);

6. Supply Input and Output

We need to supply input file that will be used during Map phase and the final output will be generated in output directory by Reduct task. Edit Run Configuration and supply command line arguments. sample.txt reside in the project root.  Your project explorer should contain following



7.Map Reduce Job Execution


8. Final Output

If you managed to come this far, Once the job is complete, it will create output directory with _SUCCESS and part_nnnnn , double click to view it in eclipse editor and you will see we have supplied 5 rows of weather data (downloaded from NCDC  weather) and we wanted to find out the maximum temperature in a given year from input file and the output will contain 2 rows with max temperature in (Centigrade) for each supplied year

1949 111 (11.1 C)
1950 22 (2.2 C)


Make sure you delete the output directory next time running your application else you will get an error from Hadoop saying directory already exists.

Happy Hadooping!

Part 2: D3.js – Playing with Data

First of all, some basics As I promised in Part 1.  D3.js is more beautiful with data, after all it makes sense to make cool visualizations with data , ultimately real value comes ouf of working with data. Data visualizations that makes data sensible to us. 

Little bit on D3 Selectors

One of the things that I love about D3 is its W3C standard-compliant (HTML,SVG, CSS). You can control the SVG elements styles thru external stylesheets. Modern browsers Chrome,Firefox,Opera are bringing new features from W3C specs and you get that out of the box with D3.js. It’s not reinventing the wheel and that’s what I like about D3.js

D3 Selectos uses W3C DOM Selectors API and it says:

The Selectors API specification defines methods for retrieving Element nodes from the DOM by matching against a group of selectors.”

You can select the elements by 
  • tag (“tag”) , 
  • class (“.class”),
  • element id (“#id”), attribute
  • (“[name=value]”), 
  • containment (“parent child”), 
  • adjacency (“before after”),
and various other facets. Predicates can be intersected (“.a.b”)
or unioned (“.a, .b”), resulting in a rich but concise selection method.

Let’s get to some real stuff now. When you use d3,  its not required to build data visualizations with SVG only, you can work with normal HTML elements as well, however when you are talking hundreds and thousand of millions of records for data visualizations with filters and animations,  SVG will blow you with results

Bring in some data,  dude!!!

<!DOCTYPE html>
<title>Hello, Bar Charts!!!</title>
var dataset = [];                        
for (var i = 0; i <50; i++) {
//Generate random number (0-35)  
 var randomNumber = Math.random() * 35;  
    .attr(“class”, “bar”)
    .style(“height”, function(d) {
    var barHeight = d * 5;  //its all about height , calculation matters 
    return barHeight + “px”;
    display: inline-block;
    width: 20px;
    height: 75px;   /*Initial height, */
    background-color: teal;
    margin-right: 2px;
Let’s understand the code now – we are selecting body element and selecting all “div” elements and binding with .data (dataset).  Notice here d3 is making use of method chaining and data() is smart enough to loop thru all array elements and applying CSS style at the same time. This is the real power. Your visualization can be complex,  but power of data() will amaze you.
Its very important to understand how visualization work in web browser. D3 does this very cleanly, 
Using D3’s enter and exit selections, you can create new nodes for incoming data and remove outgoing nodes that are no longer needed.
d3.enter it creates placeholder nodes for incoming data.

Its rather important to understand d3.exit() at the same time

d3.exitit removes existing DOM elements in the current selection for which no new data element was found




Notice the dataset variable in your Javascript console

Ok, in Part3 I am going cover how to draw with SVG with some animations and will explore more on data()
Have fun!!!