Hadoop 2 Setup on 64-bit Ubuntu 12.04 – Part 1

Hadoop on UbuntuNOTE: This post deals with only a minimal single-node cluster setup. Other posts will deal with various issues related to resource allocation on a multi-node cluster.

While setting up Hadoop 2.2.0 on Ubuntu 12.04.3 LTS 64-bit (VM on Hyper-V), I had to refer to multiple resources and had to overcome some roadblocks. The procedure that worked for me is shared here in three posts:

  1. This post describes software setup and configuration.
  2. Part 2 describes starting up processes and running an example.
  3. Part 3 describes building native libraries for the 64-bit system to give a noticeable performance boost. The downloaded distribution contains 32-bit binaries and the alternative Java libraries can’t match this performance.

Prerequisites

Before performing the setup steps below, I had to ensure that SSH and JDK6 were installed.

sudo apt-get install ssh
sudo apt-get install openjdk-6-jdk

Create Hadoop User

It’s recommended that all Hadoop-related work be performed while logged in as a designated user for this purpose. I named the user hadoop.

  1. Change to root user (Alternatively type sudo in front of each command in steps 2-5 below).
    sudo -s

    Provide password when prompted.

  2. Create user.
    useradd -d /home/hadoop -m hadoop
  3. Set password for hadoop.
    passwd hadoop

    Provide the desired password for hadoop.

  4. Add hadoop to sudoers file.
    usermod -a -G sudo hadoop
  5. Add bash as the default shell.
    usermod -s /bin/bash hadoop
  6. Connect as hadoop for the remaining steps.
    su hadoop

Configure ssh for hadoop

  1. Generate key and add to authorized keys.
    ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  2. Set permissions and ownership on key files/folders.
    sudo chmod go-w $HOME $HOME/.ssh
    sudo chmod 600 $HOME/.ssh/authorized_keys
    sudo chown `whoami` $HOME/.ssh/authorized_keys

Test SSH Setup

  1. Connect.
    ssh localhost

    Say yes if prompted for trusting. It should open connection with SSH. Log in with hadoop user/password.

  2. Close test session.
    exit

Configure Environment

  1. Create .bash_profile with the following contents. Note that HADOOP_PREFIX specifies where Hadoop lives. This can be a different path depending upon the desired setup.
    export HADOOP_PREFIX="/home/hadoop/product/hadoop-2.2.0"
    export PATH=$PATH:$HADOOP_PREFIX/bin
    export PATH=$PATH:$HADOOP_PREFIX/sbin
    export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
    export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
    export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
    export YARN_HOME=${HADOOP_PREFIX}
  2. Source .bash_profile to make the environment variables effective.
    source .bash_profile

Configure Hadoop

  1. Download the distribution from one mirrors.
  2. Unzip and extract the distribution to the $HADOOP_PREFIX path as configured above.
  3. Edit $HADOOP_PREFIX/etc/hadoop/core-site.xml to have the following contents. Note that another port may be specified.
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
        <description>The name of the default file system.  Either the
          literal string "local" or a host:port for HDFS.
        </description>
        <final>true</final>
      </property>
    </configuration>
  4. Edit $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml to have the following contents. Note that other paths may be used in the configuration below. However, if a different path is used, it must be used consistently in all the steps throughout the setup.
    <configuration>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/dfs/name</value>
        <description>Determines where on the local filesystem the DFS name node
            should store the name table.  If this is a comma-delimited list
            of directories then the name table is replicated in all of the
            directories, for redundancy.
        </description>
        <final>true</final>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/dfs/data</value>
        <description>Determines where on the local filesystem an DFS data node
            should store its blocks.  If this is a comma-delimited
            list of directories, then data will be stored in all named
            directories, typically on different devices.
            Directories that do not exist are ignored.
        </description>
        <final>true</final>
        </property>
        <property>
          <name>dfs.replication</name>
          <value>1</value>
        </property>
        <property>
        <name>dfs.permissions</name>
        <value>false</value>
      </property>
    </configuration>
  5. Create the workspace paths used in the configuration earlier. This is where HDFS lives.
    mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/dfs/name
    mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/dfs/data
  6. If mapred-site.xml doesn’t exist, copy from the template
    cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

    Edit mapred-site.xml to have the following contents.

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      <property>
        <name>mapred.system.dir</name>
        <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/mapred/system</value>
        <final>true</final>
      </property>
      <property>
        <name>mapred.local.dir</name>
        <value>file:/home/hadoop/workspace/hadoop_space/hadoop2/mapred/local</value>
        <final>true</final>
      </property>
    </configuration>
  7. Create the mapreduce paths configured earlier.
    mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/mapred/system
    mkdir -p /home/hadoop/workspace/hadoop_space/hadoop2/mapred/local
  8. Edit yarn-site.xml to have the following contents. Note mapreduce.shuffle from previous versions needs to be mapreduce_shuffle now.
    <configuration>
    <!-- Site specific YARN configuration properties -->
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
      </property>
    </configuration>
  9. Edit hadoop-env.sh to set the JAVA_HOME correctly. Use the correct path for the system.
    # The java implementation to use.
    #export JAVA_HOME=${JAVA_HOME}
    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Prepare HDFS

  1. Format hdfs to prepare for first use.
    hdfs namenode -format

    Review the output for successful completion.

In part 2, we start up HDFS and YARN, and run an example.

This entry was posted in Big Data and tagged , , , , , , , , , , , . Bookmark the permalink.

7 Responses to Hadoop 2 Setup on 64-bit Ubuntu 12.04 – Part 1

  1. Pingback: Hadoop 2 Setup on 64-bit Ubuntu 12.04 – Part 2 | Data Heads

  2. fenix says:

    lol… there are many features u dont have any clue… setting up cluster is much easier than u have written here.. and dude chk ur env variables and classpaths, u ll definitely get error while starting your cluster..

  3. dataheads says:

    Fenix,

    This post sets up a single-node cluster with minimal configuration. These are the precise steps that I have used, and they do work. I plan to write a separate post dealing with details of the cluster setup dealing with decisions about memory and other resources.

    Thanks for sharing your opinion though.

  4. Haifa says:

    Hello,

    I am beginner to linux, so how can i create .bash_profile in step 1 of configuring environment?

    thanks,
    HA

  5. Hello everybody. This one of the “easiest to follow” tutorials i have found. Very neat and precise. I too have setup a multi-node hadoop cluster inside oracle solaris 11.1 using zones. You can have a look at http://hashprompt.blogspot.in/2014/05/multi-node-hadoop-cluster-on-oracle.html

  6. sach says:

    Thanks so much for aux_service setting.

Leave a comment