Hadoop Installation and Configuration on the Pi

15Mar - by Simon - 0 - In Projects

If you have followed the various articles in the series on building a Raspberry Pi cluster for Hadoop, you are now at the final hurdle which is the actual Hadoop installation and configuration. Being a Java package, the process is relatively simple.

Quick Checklist

  1. Create directory structure and change ownership to hduser on each machine
  2. Download and untar the Hadoop package (we are using 2.7.3)
  3. Modify the environmental variables for the hduser
  4. Create / edit the Hadoop configuration files
  5. Copy the complete installation to each slave machine
  6. Setup HDFS
  7. Start the cluster and check operation

 

Directory Structure

Log into the master machine (data-master). We are using the /opt/ directory in the main filesystem. Using root or sudo then create and update the following directories:

mkdir /opt/hadoop
mkdir /opt/hdfs
chown -R hduser:hadoop /opt/hadoop
chown -R hduser:hadoop /opt/hdfs

Repeat this on each machine in the cluster (data-slave01data-slave02 and data-slave03).

Get Hadoop

This article is based on Hadoop release version 2.7.3 – the latest stable release at the time of writing. The instructions for Hadoop installation should work for any 2.7.x release but check the release notes for the version you are attempting to install to see any differences. You can grab the latest release and notes at the main Apache Hadoop site here.

Note: Hadoop Release 3.0.0 is currently in alpha. I have not tested this version on the Pi  – I would recommend sticking with 2.7.x for now. Do get in touch if you have any experience with 3.0.0 on the Pi !

To download the release package directly to the data-master machine, log in as hduser and execute the following:

wget http://www.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

This will download the tar.gz file to your /home/hduser/ directory.

Note: it is simpler to complete the installation below on data-master then copy the whole thing (/opt/hadoop/) over to each of the slaves, eliminating the repeated untar / copying and configuration editing exercise at each machine (oh yes – lesson learned) !

You can uncompress the Hadoop package and copy it into place as follows:

tar xvf hadoop-2.7.3.tar.gz
mv hadoop-2.7.3/ /opt/hadoop

Set environment variables

You can quickly set the environment variables permanently for Hadoop (and Java if required) by editing the .bashrc file. Open the file with an editor such as vi and add the following lines at the end of the file:

# -- HADOOP ENVIRONMENT VARIABLES START -- #
export HADOOP_HOME=/opt/hadoop/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

# Manually added for JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre

Save the changes. For simplicity, copy this file to all the slave machines as follows:

scp .bashrc hduser@ppl-data-slave01:/home/hduser/
scp .bashrc hduser@ppl-data-slave02:/home/hduser/
scp .bashrc hduser@ppl-data-slave03:/home/hduser/

You will need to log out and back in again as hduser to activate the bash configuration file changes.

Hadoop Configuration Files

There are a few files to create or modify in this section. Use a simple text editor such as vi. (If you are a jEdit user then you can edit in-place from your PC using the sftp extension – make sure you log in as hduser. The prompt should look something like hduser@data-master:~ $ .)

master

Create the master node definition file in /opt/hadoop/hadoop-2.7.3./etc/hadoop/ as follows:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/master

Insert a single hostname (data-master) on a single line. Nothing else to add:

data-master

slaves

On the master machine (data-master) only, create a file which lists the slave machines:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/slaves

This file contains:

data-slave01
data-slave02
data-slave03

core-site.xml

The core-site.xml file provided in the package contains no configuration. Edit the file:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/core-site.xml

and insert the following between the <configuration> tags:

<property>
	<name>fs.default.name</name>
	<value>hdfs://data-master:9000/</value>
</property>
<property>
	<name>fs.default.FS</name>
	<value>hdfs://data-master:9000/</value>
</property>

Save the file.

hdfs-site.xml

The hdfs-site.xml file provided in the package contains no configuration. Edit the file:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

and insert the following between the <configuration> tags:

<property>
	<name>dfs.datanode.data.dir</name>
		<value>/opt/hdfs/datanode</value>
		<final>true</final>
</property>
<property>
	<name>dfs.namenode.name.dir</name>
		<value>/opt/hdfs/namenode</value>
		<final>true</final>
</property>
<property>
	<name>dfs.namenode.http-address</name>
		<value>data-master:50070</value>
</property>
<property>
	<name>dfs.replication</name>
		<value>3</value>
</property>
<property>
	<name>dfs.blocksize</name>
		<value>5242880</value>
	<description>Reduce blocksize for new files to 5MB from default 128MB (smaller files expected)</description>
</property>

Save the file. Note the change to the blocksize parameter to suit the (assumed expected) smaller file sizes on the Pi cluster for training purposes.

yarn-site.xml

The yarn-site.xml file provided in the package contains no configuration. Edit the file:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/yarn-site.xml

and insert the following between the <configuration> tags:

<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.nodemanager.resource.cpu-vcores</name>
    		<value>4</value>
</property>
<property>
	<name>yarn.nodemanager.resource.memory-mb</name>
    		<value>512</value>
</property>
<property>
	<name>yarn.scheduler.minimum-allocation-mb</name>
    		<value>128</value>
</property>
<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
    		<value>256</value>
</property>
<property>
	<name>yarn.scheduler.minimum-allocation-vcores</name>
    		<value>1</value>
</property>
<property>
	<name>yarn.scheduler.maximum-allocation-vcores</name>
    		<value>4</value>
</property>
<property>
	<name>yarn.nodemanager.vmem-check-enabled</name>
   		<value>false</value>
   		<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
   		<value>4</value>
   		<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
       	<name>yarn.resourcemanager.resource-tracker.address</name>
        	<value>data-master:8025</value>
</property>
<property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>data-master:8035</value>
</property>
<property>
        <name>yarn.resourcemanager.address</name>
        <value>data-master:8050</value>
</property>    

Save the file.

mapred-site.xml

This file can be built from the template example but the simplest way is just to create the file using vi as follows:

vi /opt/hadoop/hadoop-2.7.3/etc/hadoop/mapred-site.xml

and insert the following lines to the new file:

    <property>
        <name>mapreduce.job.tracker</name>
        <value>data-master:5431</value>
    </property>
    <property>
        <name>mapred.framework.name</name>
        <value>yarn</value>
    </property>

As ever, save the file.

At this point – and assuming you have not yet installed Hadoop into the /opt/ directory on each of the slaves – you can copy the whole package including configuration files to each slave from the master using scp:

scp -r /opt/hadoop/hadoop-2.7.3 hduser@data-slave01:/opt/hadoop/
scp -r /opt/hadoop/hadoop-2.7.3 hduser@data-slave02:/opt/hadoop/
scp -r /opt/hadoop/hadoop-2.7.3 hduser@data-slave03:/opt/hadoop/

NB: you will need to delete the slaves file from each slave as it is not needed (execute rm /opt/hadoop/hadoop-2.7.3/etc/hadoop/slaves on each slave machine).

Setup HDFS

At each slave – data-slave01, data-slave02 and data-slave03 – using the hduser, execute the following:

mkdir /opt/hdfs/datanode

At data-master using the hduser, execute the following:

mkdir /opt/hdfs/namenode
mkdir /opt/hdfs/datanode
cd $HADOOP_HOME/bin/
./hdfs namenode -format

This should format the hdfs ‘volume’ and make it ready for use.

Start the Cluster

The moment of truth. You should now be able to log into the namenode (data-master) which controls the cluster, and start Hadoop. There are two components to the application we have installed – the filesystem (HDFS) and the resource controller (YARN). Start each as follows:
/opt/hadoop/hadoop-2.7.3/sbin/start-dfs.sh
f all is well, the script will display messages similar to the following, before returning to the prompt:

<date and time> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [data-master]
data-master: starting namenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-namenode-data-master.out
data-master: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-datanode-data-master.out
data-slave03: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-datanode-data-slave03.out
data-slave01: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-datanode-data-slave01.out
data-slave02: starting datanode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-datanode-data-slave02.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.7.3/logs/hadoop-hduser-secondarynamenode-data-master.out
<date and time> WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Note the warnings about native code are normal. Then execute the following to start YARN:
/opt/hadoop/hadoop-2.7.3/sbin/start-yarn.sh
and again, you should see messages like:

starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-hduser-resourcemanager-ppl-data-master.out
ppl-data-master: starting nodemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-hduser-nodemanager-ppl-data-master.out
ppl-data-slave02: starting nodemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-hduser-nodemanager-ppl-data-slave02.out
ppl-data-slave03: starting nodemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-hduser-nodemanager-ppl-data-slave03.out
ppl-data-slave01: starting nodemanager, logging to /opt/hadoop/hadoop-2.7.3/logs/yarn-hduser-nodemanager-ppl-data-slave01.out

Hadoop is now running happily on the cluster. You can verify this further by browsing to the main web interface for the cluster/ Open a browser and enter the address to point to port 8088 on data-master (192.168.15.130). You should see a screen along the lines of:

hadoop-web-main

There are a number of useful information sources on this interface so browse around the links on the left.

hadoop-web-nodes

You can also check on the individual namenode health and related information by browsing to port 50070 on 192.168.15.130. Other checks worth running from the CLI to get a feel for the vast amount of operational information that is available on the filesystem and cluster:

jps – this provides an equivalent to the Linux ps command but for Java programmes. Run this on the namenode (the data-master server in our cluster) and you should get a similar response to the following:

1635 DataNode
2035 ResourceManager
2451 Jps
1531 NameNode
1803 SecondaryNameNode
2141 NodeManager

Stop the cluster

All is fine and dandy. To shutdown the cluster, simply run the shutdown scripts provided:

/opt/hadoop/hadoop-2.7.3/sbin/stop-yarn.sh
/opt/hadoop/hadoop-2.7.3/sbin/stop-dfs.sh

Check out the related articles as there are a number of quick applications to run against the cluster. A common favourite is the word count exercise which is also a useful introduction to the whole business of using Hadoop with Java. There are other ways to push data and run analytics and these will be explored in other articles but don’t forget that there is a wealth of information on the ‘net to guide your training on the cluster.

Jump back to the main Hadoop article here.

Leave a Reply

Your email address will not be published. Required fields are marked *