USB Disks on Pi

6Apr - by Simon - 3 - In SysAdmin

hadoop-pi-cluster-with-usb-stick-storageOne of the unknowns arising from the Hadoop on Pi project is whether HDFS shortens the life of the SD card on which the operating system runs. As an option, USB disks on Pi can be used as the physical devices on which HDFS volumes can be mounted. This article runs through how to add them to the existing Pi cluster. Whilst the article relates to that project, it is also useful if you just want to use USB storage on your Pi (or any similar Linux platform).

Quick checklist

  1. (optionally) format your USB disks
  2. Insert and establish the device ID (UUID) of the inserted device (repeat for each Pi in the cluster)
  3. Mount the device (repeat for each Pi in the cluster)
  4. Adjust fstab to automatically mount the device on boot (repeat for each Pi in the cluster)
  5. Adjust the Hadoop configuration to use the USB device(s)
  6. Initialise and use the new storage for HDFS

 

For the project, I have used SanDisk Cruzer 16GB USB devices. These come already formatted for FAT32 which is usable directly by the Pi. Note that these sticks also contain an encryption programme – provided by SanDisk – which can be deleted. I have not attempted to use this encryption for HDFS although it might be an interesting side project.

1. Format the USB device – my SanDisk sticks are already formatted as FAT32 so I have skipped this. If you want a particular format then you can format it on your PC or Mac as you would any similar device.

 

2. Insert the USB sticks to the Pi devices in the cluster and power-up. You need to find, and note, the unique identifier for each disk. Execute the following command:

ls -l /dev/disk/by-uuid/

This presents a list of devices – look for the entry labelled “../../sda1” – this should refer to the USB stick you inserted:

usb-disk-on-pi-by-uuid

The example shows a UUID of 88D6-5AFC – make a note as you will need this info to mount and permanently record the device for future boots.

3. Mount the device – first you need to create a mount point in the existing filesystem. This is a directory – it can be named anything suitable. As this article is concerned with mounting a USB stick to use as an HDFS volume then you can use – for example – /usb/hdfs

sudo mkdir /usb/

sudo mkdir /usb/hdfs

Then change ownership to ensure the directory can be written by the hduser (note the credentials match those created in the main articles when the Hadoop cluster was built):

sudo chown -R hduser:hadoop /usb/

This gives us a mount point for the USB stick within the Raspbian (or other Linux variant, if you are adding to a non-Pi device) file system. At this point you can (temporarily) mount the device, giving ownership to the hduser and the hadoop group,  and use it normally as follows:

sudo mount /dev/sda1 /usb/hdfs -o uid=hduser, gid=hadoop

Under the hduser, you can now read from, and write to, the device as part of the filesystem without using sudo. If you want to un-mount the device – doen automatically if you power down the Pi at this point – you can execute the following command. This should be done before physically removing the USB device from the Pi when running:

umount /usb/hdfs

You can also execute the un-mount command if you have added the device to the fstab (see below) and made it an otherwise permanent addition to the filesystem. In this case, you will need to prefix the command with sudo. Note that the system will attempt to re-mount this device when the Pi is next booted (unless you have removed or commented-out the entry in fstab. See the next section.

sudo umount /usb/hdfs/

4. Modify fstab to include the new device – I used vi to add in the following line:

UUID=88D6-5AFC /usb/hdfs vfat auto,nofail,noatime,users,rw,uid=hduser,gid=hadoop 0 0

To give an fstab that looks similar to the following (note the comment lines – beginning with a # – are added for readability):

USB-device-to-fstab

Obviously, don’t forget to save the file after adding in the line. The system will now automatically add the device to the filesystem at the indicated point (/usb/hdfs/) every time the device boots. This is good enough for using any USB stick on Pi. For the Hadoop project, we also need to modify the Hadoop configuration to make use of the new storage, and initialise it.

NB: If you are adding USB devices to the Hadoop cluster, don’t forget to repeat these steps – identifying the USB device UUID and adding it to fstab – for each computer in the cluster. Make sure you reboot each machine to ensure the new fstab is loaded and the USB stick is accessible as /usb/hdfs/.

5. Adjust the Hadoop configuration – make sure HDFS and YARN are stopped on the cluster before modifying the setup. Again, it is important that each computer in the cluster is modified and has its USB storage installed before re-starting any part of Hadoop. If you followed the Hadoop installation guide in the article series for Hadoop on Pi, then you will find the configuration files on each computer in the cluster within the etc/hadoop/ directory in

/opt/hadoop/hadoop-2.7.3/etc/hadoop/

Modify (use vi or some other editor) the hdfs-site.xml file. Change both the datanode.data.dir and the namenode.name.dir properties to reflect the new storage. In particular, change:

<property>
         <name>dfs.datanode.data.dir</name>
         <value>/opt/hdfs/datanode</value>
         <final>true</final>
</property>

to

<property>
         <name>dfs.datanode.data.dir</name>
         <value>/usb/hdfs/datanode</value>
         <final>true</final>
</property>

AND also change

<property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hdfs/namenode</value>
        <final>true</final>
</property>

to

<property>
        <name>dfs.namenode.name.dir</name>
        <value>/usb/hdfs/namenode</value>
        <final>true</final>
</property>

Again, you can make the change manually on each computer in the cluster or you can use SSH to copy the amended configuration to each device with the scp command. For example, to copy hdfs-site.xml to the data-slave01 machine:

scp /opt/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml hduser@data-slave01:/opt/hadoop/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

Whilst logged in as the hduser, create the following directories on the master (namenode) at data-master:

mkdir /usb/hdfs/namenode

mkdir /usb/hdfs/datanode

And finally, at each of the slave nodes (data-slave01 through data-slave03) then create the following directory:

mkdir /usb/hdfs/datanode

At this point, you can initialise the HDFS volume from the master node (data-master).

6. Initialise the HDFS volume by executing the following commands from the master / namenode computer (data-master) as hduser:

cd $HADOOP_HOME/bin/
./hdfs namenode -format

If successful, you should see a message on data-master which ends with:

INFO util.ExitUtil: Exiting with status 0

INFO namenode.NameNode: SHUTDOWN_MSG: 

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at data-master/192.168.15.130

************************************************************/

The cluster is now configured to use HDFS on the USB sticks. Test this by bringing up the cluster with the relevant scripts. Job done.

Final note : this configuration retains logging to the original /opt/ directory on the SD card. It might be worthwhile also modifying the configuration to log to the new USB location. (Hint: modify log4j.properties to set the hadoop.log.dir property).

3 thoughts on “USB Disks on Pi”

  1. Pingback: hadoop pi
  2. Hi, thanks for this great article on adding USB to a Hadoop Pi cluster. I have a question. When you change the XML to point to the USB locations, does this remove the current storage access on each of the SD cards? Second, do I need to add matching sized/make USBs on each node, or can they be a mix of sizes/brands? One of my USB’s has a really long ID and throws an error, while the other smaller one shows the same naming scheme as your USB UUID example.

    Thank You.

  3. Hi Jack,

    (1) Shifting the hdfs mount point appears to leave the original location intact – I checked the original /opt/hadoop/hdfs/ directory and it still contains the datanode directories on each of the Pi MicroSD cards. I assume (but I have not tried this) that you could switch between this and the USB sticks, stopping and starting DFS after each config change. If you have data on the original volume that is needed on the new then you could export that data under the original config and re-import using the new.

    (2) Excellent question and sorry to say that I am not sure. Everything I have read suggests you should always attempt to match the specs on each machine in the cluster, with the possible exception of the namenode. For our purposes, I suspect that it does not matter if your USB sticks are different. UUIDs can vary in length and are set by formatting the device – are all your sticks formatted the same way ?

    Articles to check out:
    Ask Ubuntu on setting UUIDs
    Cloudera blog on selecting the right hardware for Hadoop
    Lester Martin on Hadoop storage and mount points

    I will certainly have a play and see what I can establish and I would recommend you try too. Whilst, er, size might not matter it is important to make sure that all devices have the same format.

    Hope this helps.

    Kind regards
    Simon

Leave a Reply

Your email address will not be published. Required fields are marked *