After spending some time playing around on Single-Node pseudo-distributed cluster its time to get into real world hadoop. Depending on what works best – Its important to note that there are multiple ways to achieve this and I am going to cover how to setup multi-node hadoop cluster on Amazon EC2. We are going to setup 4 node hadoop cluster as below.
- NameNode (Master)
- DataNode (Slave1)
- DataNode (Slave2)
Here’s what you will need
- Amazon AWS Account
- PuTTy Windows Client (to connect to Amazon EC2 instance)
- PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
- WinSCP (secury copy)
This will be a two part series
In Part-1 I will cover infrastructure side as below
- Setting up Amazon EC2 Instances
- Setting up client access to Amazon Instances (using Putty.)
- Setup WinSCP access to EC2 instances
In Part-2 I will cover the hadoop multi node cluster installation
- Hadoop Multi-Node Installation and setup
1. Setting up Amazon EC2 Instances
With 4 node clusters and minimum volume size of 8GB there would be an average $2 of charge per day with all 4 running instances. You can stop the instance anytime to avoid the charge, but you will loose the public IP and host and restarting the instance will create new ones,. You can also terminate your Amazon EC2 instance anytime and by default it will delete your instance upon termination, so just be careful what you are doing.
1.1 Get Amazon AWS Account
If you do not already have a account, please create a new one. I already have AWS account and going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.
1.2 Launch Instance
Once you have signed up for Amazon account. Login to Amazon Web Services, click on My Account and navigate to Amazon EC2 Console
1.3 Select AMI
I am picking Ubuntu Server 12.04.3 Server 64-bit OS
Select the micro instance
1.5 Configure Number of Instances
As mentioned we are setting up 4 node hadoop cluster, so please enter 4 as number of instances. Please check Amazon EC2 free-tier requirements, you may setup 3 node cluster with < 30GB storage size to avoid any charges. In production environment you want to have SecondayNameNode as separate machine
1.6 Add Storage
Minimum volume size is 8GB
1.7 Instance Description
Give your instance name and description
Create a new security group, later on we are going to modify the security group with security rules.
1.9 Launch Instance and Create Security Pair
Review and Launch Instance.
Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.
Create a new keypair and give it a name “hadoopec2cluster” and download the keypair (.pem) file to your local machine. Click Launch Instance
1.10 Launching Instances
Once you click “Launch Instance” 4 instance should be launched with “pending” state
Once in “running” state we are now going to rename the instance name as below.
- HadoopNameNode (Master)
- HadoopSlave1 (data node will reside here)
- HaddopSlave2 (data node will reside here)
Please note down the Instance ID, Public DNS/URL (ec2-54-209-221-112.compute-1.amazonaws.com) and Public IP for each instance for your reference.. We will need it later on to connect from Putty client. Also notice we are using “HadoopEC2SecurityGroup”.
You can use the existing group or create a new one. When you create a group with default options it add a rule for SSH at port 22.In order to have TCP and ICMP access we need to add 2 additional security rules. Add ‘All TCP’, ‘All ICMP’ and ‘SSH (22)’ under the inbound rules to “HadoopEC2SecurityGroup”. This will allow ping, SSH, and other similar commands among servers and from any other machine on internet. Make sure to “Apply Rule changes” to save your changes.
These protocols and ports are also required to enable communication among cluster servers. As this is a test setup we are allowing access to all for TCP, ICMP and SSH and not bothering about the details of individual server port and security.
2.Setting up client access to Amazon Instances
Now, lets make sure we can connect to all 4 instances.For that we are going to use Putty client We are going setup password-less SSH access among servers to setup the cluster. This allows remote access from Master Server to Slave Servers so Master Server can remotely start the Data Node and Task Tracker services on Slave servers.
We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In order to generate the private key we need Puttygen client. You can download the putty and puttygen and various utilities in zip from here.
2.1 Generating Private Key
Let’s launch PUTTYGEN client and import the key pair we created during launch instance step – “hadoopec2cluster.pem”
Navigate to Conversions and “Import Key”
Once you import the key You can enter passphrase to protect your private key or leave the passphrase fields blank to use the private key without any passphrase. Passphrase protects the private key from any unauthorized access to servers using your machine and your private key.
Any access to server using passphrase protected private key will require the user to enter the passphrase to enable the private key enabled access to AWS EC2 server.
2.2 Save Private Key
Now save the private key by clicking on “Save Private Key” and click “Yes” as we are going to leave passphrase empty.
Save the .ppk file and give it a meaningful name
Now we are ready to connect to our Amazon Instance Machine for the first time.
2.3 Connect to Amazon Instance
Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the .ppk private key that we just created for password-less SSH access. As per amazon documentation, for Ubuntu machines username is “ubuntu”
2.3.1 Provide private key for authentication
2.3.2 Hostname and Port and Connection Type
and “Open” to launch putty session
and will prompt you for the username, enter ubuntu, if everything goes well you will be presented welcome message with Unix shell at the end.
Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2 respectively to make sure you can connect successfully.
2.4 Enable Public Access
Issue ifconfig command and note down the ip address. Next, we are going to update the hostname with ec2 public URL and finally we are going to update /etc/hosts file to map the ec2 public URL with ip address. This will help us to configure master ans slaves nodes with hostname instead of ip address.
Following is the output on HadoopNameNode ifconfig
now, issue the hostname command, it will display the ip address same as inet address from ifconfig command.
We need to modify the hostname to ec2 public URL with below command
$ sudo hostname ec2-54-209-221-112.compute-1.amazonaws.com
2.5 Modify /etc/hosts
Lets change the host to EC2 public IP and hostname.
Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace that with amazon ec2 hostname and ip address we just collected.
Modify the file and save your changes
Repeat 2.3 and 2.4 sections for remaining 3 machines.
3. Setup WinSCP access to EC2 instances
In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a handy utility.
Provide hostname, username and private key file and save your configuration and Login
If you see above error, just ignore and you upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2 Ubuntu machine.
Upload the .pem file to master machine (HadoopNameNode). It will be used while connecting to slave nodes during hadoop startup daemons.
If you have made this far, good work! Things will start to get more interesting in Part-2.