NOTE: This post deals with only a minimal single-node cluster setup. Other posts will deal with various issues related to resource allocation on a multi-node cluster.
While setting up Hadoop 2.2.0 on Ubuntu 12.04.3 LTS 64-bit (VM on Hyper-V), I had to refer to multiple resources and had to overcome some roadblocks. The procedure that worked for me is shared here in three posts:
- This post describes software setup and configuration.
- Part 2 describes starting up processes and running an example.
- Part 3 describes building native libraries for the 64-bit system to give a noticeable performance boost. The downloaded distribution contains 32-bit binaries and the alternative Java libraries can’t match this performance.
Prerequisites
Before performing the setup steps below, I had to ensure that SSH and JDK6 were installed.
sudo apt-get install ssh sudo apt-get install openjdk-6-jdk
Create Hadoop User
It’s recommended that all Hadoop-related work be performed while logged in as a designated user for this purpose. I named the user hadoop
.
- Change to root user (Alternatively type
sudo
in front of each command in steps 2-5 below).sudo -s
Provide password when prompted.
- Create user.
useradd -d /home/hadoop -m hadoop
- Set password for hadoop.
passwd hadoop
Provide the desired password for
hadoop
. - Add hadoop to sudoers file.
usermod -a -G sudo hadoop
- Add bash as the default shell.
usermod -s /bin/bash hadoop
- Connect as
hadoop
for the remaining steps.su hadoop
Configure ssh for hadoop
- Generate key and add to authorized keys.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
- Set permissions and ownership on key files/folders.
sudo chmod go-w $HOME $HOME/.ssh sudo chmod 600 $HOME/.ssh/authorized_keys sudo chown `whoami` $HOME/.ssh/authorized_keys
Test SSH Setup
- Connect.
ssh localhost
Say yes if prompted for trusting. It should open connection with SSH. Log in with hadoop user/password.
- Close test session.
exit
Configure Environment
- Create
.bash_profile
with the following contents. Note thatHADOOP_PREFIX
specifies where Hadoop lives. This can be a different path depending upon the desired setup.export HADOOP_PREFIX="/home/hadoop/product/hadoop-2.2.0" export PATH=$PATH:$HADOOP_PREFIX/bin export PATH=$PATH:$HADOOP_PREFIX/sbin export HADOOP_MAPRED_HOME=${HADOOP_PREFIX} export HADOOP_COMMON_HOME=${HADOOP_PREFIX} export HADOOP_HDFS_HOME=${HADOOP_PREFIX} export YARN_HOME=${HADOOP_PREFIX}
- Source
.bash_profile
to make the environment variables effective.source .bash_profile
Configure Hadoop
- Download the distribution from one mirrors.
- Unzip and extract the distribution to the
$HADOOP_PREFIX
path as configured above. - Edit
$HADOOP_PREFIX/etc/hadoop/core-site.xml
to have the following contents. Note that another port may be specified.<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> <description>The name of the default file system. Either the literal string "local" or a host:port for HDFS. </description> <final>true</final> </property> </configuration>
- Edit
$HADOOP_PREFIX/etc/hadoop/hdfs-site.xml
to have the following contents. Note that other paths may be used in the configuration below. However, if a different path is used, it must be used consistently in all the steps throughout the setup.<configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/dfs/data</value> <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. </description> <final>true</final> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
- Create the workspace paths used in the configuration earlier. This is where HDFS lives.
mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/dfs/name mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/dfs/data
- If
mapred-site.xml
doesn’t exist, copy from the templatecp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
Edit
mapred-site.xml
to have the following contents.<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapred.system.dir</name> <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/mapred/system</value> <final>true</final> </property> <property> <name>mapred.local.dir</name> <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/mapred/local</value> <final>true</final> </property> </configuration>
- Create the mapreduce paths configured earlier.
mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/mapred/system mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/mapred/local
- Edit
yarn-site.xml
to have the following contents. Notemapreduce.shuffle
from previous versions needs to bemapreduce_shuffle
now.<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
- Edit hadoop-env.sh to set the JAVA_HOME correctly. Use the correct path for the system.
# The java implementation to use. #export JAVA_HOME=${JAVA_HOME} export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
Prepare HDFS
- Format hdfs to prepare for first use.
hdfs namenode -format
Review the output for successful completion.
In part 2, we start up HDFS and YARN, and run an example.
Reblogged this on Sutoprise Avenue, A SutoCom Source.
Pingback: Hadoop 2 Setup on 64-bit Ubuntu 12.04 – Part 2 | Data Heads
lol… there are many features u dont have any clue… setting up cluster is much easier than u have written here.. and dude chk ur env variables and classpaths, u ll definitely get error while starting your cluster..
Fenix,
This post sets up a single-node cluster with minimal configuration. These are the precise steps that I have used, and they do work. I plan to write a separate post dealing with details of the cluster setup dealing with decisions about memory and other resources.
Thanks for sharing your opinion though.
Hello,
I am beginner to linux, so how can i create .bash_profile in step 1 of configuring environment?
thanks,
HA
Hello everybody. This one of the “easiest to follow” tutorials i have found. Very neat and precise. I too have setup a multi-node hadoop cluster inside oracle solaris 11.1 using zones. You can have a look at http://hashprompt.blogspot.in/2014/05/multi-node-hadoop-cluster-on-oracle.html
Thanks so much for aux_service setting.