Setting up Hadoop multi-node cluster on Amazon EC2 Part 1


After spending some time playing around on Single-Node pseudo-distributed cluster its time to get into real world hadoop. Depending on what works best – Its important to note that there are multiple ways to achieve this and I am going to cover how to setup multi-node hadoop cluster on Amazon EC2. We are going to setup 4 node hadoop cluster as below.

  • NameNode (Master)
  • SecondaryNameNode
  • DataNode (Slave1)
  • DataNode (Slave2)

Here’s what you will need

  1. Amazon AWS Account
  2. PuTTy Windows Client (to connect to Amazon EC2 instance)
  3. PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
  4. WinSCP (secury copy)

This will be a two part series

In Part-1 I will cover infrastructure side as below

  1. Setting up Amazon EC2 Instances
  2. Setting up client access to Amazon Instances (using Putty.)
  3. Setup WinSCP access to EC2 instances

In Part-2 I will cover the hadoop multi node cluster installation

  1. Hadoop Multi-Node Installation and setup

1. Setting up Amazon EC2 Instances

With 4 node clusters and minimum volume size of 8GB there would be an average $2 of charge per day with all 4 running instances. You can stop the instance anytime to avoid the charge, but you will loose the public IP and host and restarting the instance will create new ones,. You can also terminate your Amazon EC2 instance anytime and by default it will delete your instance upon termination, so just be careful what you are doing.

1.1 Get Amazon AWS Account

If you do not already have a account, please create a new one. I already have AWS account and going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.

Screen shot 2014-01-10 at 1.33.28 PM

1.2 Launch Instance

Once you have signed up for Amazon account. Login to Amazon Web Services, click on My Account and navigate to Amazon EC2 Console

Launch_Instance

1.3 Select AMI

I am picking Ubuntu Server 12.04.3  Server 64-bit OS

Step_1_choose_AMI1.4 Select Instance Type

Select the micro instance

Step_2_Instance_Type

1.5 Configure Number of Instances

As mentioned we are setting up 4 node hadoop cluster, so please enter 4 as number of instances. Please check Amazon EC2 free-tier requirements, you may setup 3 node cluster with < 30GB storage size to avoid any charges.  In production environment you want to have SecondayNameNode as separate machine

Step_3_Instance_Details

1.6 Add Storage

Minimum volume size is 8GB

Step_4_Add_Storage

1.7 Instance Description

Give your instance name and description

Step_5_Instance_description1.8 Define a Security Group

Create a new security group, later on we are going to modify the security group with security rules.

Step_6_Security_Group

1.9 Launch Instance and Create Security Pair

Review and Launch Instance.

Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.

Create a new keypair and give it a name “hadoopec2cluster” and download the keypair (.pem) file to your local machine. Click Launch Instance

hadoopec2cluster_keypair

1.10 Launching Instances

Once you click “Launch Instance” 4 instance should be launched with “pending” state

launching_instance

Once in “running” state we are now going to rename the instance name as below.

  1. HadoopNameNode (Master)
  2. HadoopSecondaryNameNode
  3. HadoopSlave1 (data node will reside here)
  4. HaddopSlave2  (data node will reside here)

running_instances

Please note down the Instance ID, Public DNS/URL (ec2-54-209-221-112.compute-1.amazonaws.com)  and Public IP for each instance for your reference.. We will need it later on to connect from Putty client.  Also notice we are using “HadoopEC2SecurityGroup”.

Public_DNS_IP_instance_id

You can use the existing group or create a new one. When you create a group with default options it add a rule for SSH at port 22.In order to have TCP and ICMP access we need to add 2 additional security rules. Add ‘All TCP’, ‘All ICMP’ and ‘SSH (22)’ under the inbound rules to “HadoopEC2SecurityGroup”. This will allow ping, SSH, and other similar commands among servers and from any other machine on internet. Make sure to “Apply Rule changes” to save your changes.

These protocols and ports are also required to enable communication among cluster servers. As this is a test setup we are allowing access to all for TCP, ICMP and SSH and not bothering about the details of individual server port and security.

Security_Group_Rule

2.Setting up client access to Amazon Instances

Now, lets make sure we can connect to all 4 instances.For that we are going to use Putty client We are going setup password-less SSH access among servers to setup the cluster. This allows remote access from Master Server to Slave Servers so Master Server can remotely start the Data Node and Task Tracker services on Slave servers.

We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In order to generate the private key we need Puttygen client. You can download the putty and puttygen and various utilities in zip from here.

2.1 Generating Private Key

Let’s launch PUTTYGEN client and import the key pair we created during launch instance step – “hadoopec2cluster.pem”

Navigate to Conversions and “Import Key”

import_key

load_private_keyOnce you import the key You can enter passphrase to protect your private key or leave the passphrase fields blank to use the private key without any passphrase. Passphrase protects the private key from any unauthorized access to servers using your machine and your private key.

Any access to server using passphrase protected private key will require the user to enter the passphrase to enable the private key enabled access to AWS EC2 server.

2.2 Save Private Key

Now save the private key by clicking on “Save Private Key” and click “Yes” as  we are going to leave passphrase empty.

Save_PrivateKey

Save the .ppk file and give it a meaningful name

save_ppk_file

Now we are ready to connect to our Amazon Instance Machine for the first time.

2.3 Connect to Amazon Instance

Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the .ppk private key that we just created for password-less SSH access. As per amazon documentation, for Ubuntu machines username is “ubuntu”

2.3.1 Provide private key for authentication

connect_to_HadoopNameNode3

2.3.2 Hostname and Port and Connection Type

and “Open” to launch putty session

connect_to_HadoopNameNode4

when you launch the session first time, you will see below message, click “Yes”connect_to_HadoopNameNode

and will prompt you for the username, enter ubuntu, if everything goes well you will be presented welcome message with Unix shell at the end.

connect_to_HadoopNameNode2

If there is a problem with your key, you may receive below error messageputty_error_message

Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2 respectively to make sure you can connect successfully.

4_connnected_amazon_instances

2.4 Enable Public Access

Issue ifconfig command and note down the ip address. Next, we are going to update the hostname with ec2 public URL and finally we are going to update /etc/hosts file to map  the ec2 public URL with ip address. This will help us to configure master ans slaves nodes with hostname instead of ip address.

Following is the output on HadoopNameNode ifconfig

ifconfig

host_name_ip_address_mapping

now, issue the hostname command, it will display the ip address same as inet address from ifconfig command.

hostname_by_ip

We need to modify the hostname to ec2 public URL with below command

sudo hostname ec2-54-209-221-112.compute-1.amazonaws.com

update_hostname

2.5 Modify /etc/hosts

Lets change the host to EC2 public IP and hostname.

Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace that with amazon ec2 hostname and ip address we just collected.

open_etc_hosts_in_vi

Modify the file and save your changes

modify_vi_hosts

Repeat 2.3 and 2.4 sections for remaining 3 machines.

3. Setup WinSCP access to EC2 instances

In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a handy utility.

Provide hostname, username and private key file and save your configuration and Login

winscp

winscp_error

If you see above error, just ignore and you upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2 Ubuntu machine.

winscp_view

Upload the .pem file to master machine (HadoopNameNode). It will be used while connecting to slave nodes during hadoop startup daemons.

If you have made this far, good work! Things will start to get more interesting in Part-2.

27 comments

  1. Good article.
    Just a question, do you really need to map the IP addresses to the host as every time cluster restarts new public IPs are issued and mapping would need to be changed ?

  2. Excellent Tutorial!
    However, I am stuck at a step to connect to instance using Putty. I am getting “Server Refused our key” error. Could you please point to a solution for this. Thanks!

  3. Hello.. I have followed your tutorial.. but unfortunately I cannot connect to any instances. It always refused. when I tried to ping the instance’s public IP, always return request timed out… do you have any solution for this problem?

    1. You should try to reach your AWS Instance from putty or WINSCP with Public IP – AWS provides .pem key file – you need to create ppk out of it using puttygen and try connecting to the machine first

    1. are you issuing the scp command correctly

      it should be

      scp sourceuser@sourcehost:/path/to/source/file destinationuser@destinationhost:/path/to/destination/

      hope it helps

      1. OMG, this is fantastic. you provide me all details and snapshot. I could not expect any better answer than this.

        This is absolutely the best video for Haddop/ec2/ubuntu tutorials of all. Let me following it up and get it done.

        your support is greatly appreciated. Thanks million:)

        Cheers,

        Robin

  4. This isn’t a robust way of setting up a hadoop cluster. The ip you got from ifconfig is dynamic and can change at anytime. You should really be using a vpc and associate the vpc security group to your cluster.

    1. Thanks Saurabh, really appreciate your comments – I am always looking to improve on how I do things and will certainly be looking forward for vpc solution. Thanks again!

  5. This is excellent tutorials, you made it easy to follow through.
    I have some concern here. It says Hadoop is not good with ipv6, do we need to disable ipv6 for hadoop installation??
    I was able to upload my .pem and ppk file to servers, but I wonder, can I just copy the contents and nano/past to a new files on server? seemed they all can be opened as a text file.

    Thanks again for your wonderful video.

    Robin

  6. Robin again.
    I have uploaded .pem keys and able to ssh-add keyname.pem. however, ssh to any other nodes just dont work and I got error ‘permission denied (publickey). it seemed I never able to ssh to any other nodes.
    also, at part 2, it says ‘ Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.’ I didnt see any prompt to enter password for ssh-agent.

    So what did I miss? please help me here.

    thank you so much.

  7. hello,
    i am getting a problem in starting namenode. on starting the cluster it shows an error of ” namenode:ssh connection: port 22:connection time out” . can you please help me fixing the problem…

  8. Thanks for the valuable learning and feel grateful for it.
    It works very well.
    I have completed my infrastructure setup as mentioned in this tutorial.

    I have setup the VPC security group.
    And given the inbound rules as TCP and ICMP from all IP addresses.
    The outbound rules is all TCP by default.

    So I don’t have to do the etc hosts file setup as mentioned here.
    Since the instances will be able to talk to each other.
    Is this correct?

  9. Hi,

    I am getting connection timeout error while connecting via Putty.
    Could you please help me?

    Thank you,
    Prajakta Shinde.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s