Hadoop on the Pi – A Big Data Playground
This is the main article in a series of posts on implementing Hadoop on the Raspberry Pi. The aim is to build an HDFS filesystem available for processing on a small Hadoop Pi cluster. This provides a system to get a good grounding in the installation and operation of HDFS and related components such as YARN, using a cheap and readily delivered platform. It is easy to implement and supported by a wealth of articles across the ‘net – please see the Supporting Articles section. The resulting cluster can be used as a small testbed or training solution.
Hadoop is the name generally applied to a bunch of related open-source technologies which contribute to the over-used moniker of Big Data. Essentially, it is one component in an ecosystem which makes it much easier to collect, store, analyse and transform large amounts of data with little or no risk of data loss.
As a builder of traditional RDBMS systems over the last 30 years – especially MySQL and PostgreSQL – I wanted to discover whether the technology was useful and usable by an old SQL hand. Given its design as a handler of large amounts of data on clustered architecture, ‘playing’ with Hadoop required a number of identical servers, ideally with decent storage and a fast network. You can purchase a variety of cloud-hosted SaaS solutions on short term contracts, but it is possible to deploy a small cluster using the Pi – enabling a limited Hadoop cluster for less than £200 and a day’s effort.
The basic process is listed below, with each step linking to a more detailed article with general guidance and instructions.
- The basic operating system build for each Pi
- Configuring the cluster network to suit Hadoop operation
- Implementing inter-server communication – SSH configuration
- Java installation on the Pi
- Hadoop installation and configuration
- Testing the cluster
Big Data Expertise :
Whilst Hadoop is arguably a collection of tools and technologies, it boils down to the provision of a sophisticated and flexible file storage solution in HDFS. You do need a degree of Linux expertise – at least some confidence working with the Linux command line and a familiarity with scripts and text editors. You do not need to understand the guts of the thing, however. Above all, there is a plethora of amazing articles and assistance available. You can build a small, cheap cluster and use that as your stepping stone to Big Data knowledge.
Many Moving Parts :
It is Java. That brings a degree of simplicity when adding components – you do not need to compile / build large, unwieldy source collections in order to make things work. It is also highly mature and comes with all the necessary scripts and interfaces to start, use and stop the system.
Management Software is Essential :
Software such as Apache Ambari and Cloudera’s CDH provide deployment, monitoring and management solutions which, whilst important for scale / production builds, are not necessary to deploy a basic Hadoop file system, with YARN and MapReduce. Again, this project proves that you do not need a huge, complex infrastructure to start gaining an understanding of ‘Big Data’ tech.
Pointless exercise ! : “The Pi is too small to do anything”
True, you can’t do it all on a Pi. Components can be added to the cluster but, whilst you can add Spark and perhaps one or two other useful overlays, the power and space of the Pi will limit the processing you can complete. This is not a production solution. 1Gb RAM and SD or USB storage can be limiting, as will be the 32bit JVM.
Unproved (so far … )
“HDFS on the Pi Eats SD-cards” :
I am not yet clear if the active nature of HDFS file management degrades SD cards on the Pi any quicker, if at all. Articles on the ‘net seem to suggest this is the case and hence the reason why moving the HDFS storage to a USB drive might be sensible for the Pi cluster. Using a USB drive on the Pi for HDFS is easy to achieve but I do not know if it is necessary or even a solution to any SD card degradation. For a general discussion on SD card life, see this article on Apotelyt.com, or check out this excellent article on Richard’s Ramblings (originally 2013 but since updated). [UPDATE: I have now shifted the HDFS volumes to USB sticks on the Pi cluster – see the Techno Babble article here.]
Hadoop Pi – So now what ?
As this shows, it is possible to build a training cluster for Hadoop on Pi … but now what ? See the sources below for additional work that can be done on the cluster, including the ubiquitous ‘word count’ test and an installation of Spark onto the same cluster. Related projects will feature in future Techno Babble articles.
I am hugely grateful to the authors of the articles below. Whilst I found no one single piece that covered every detail completely, they all offer excellent guides to building Hadoop on the Pi and are well worth your time:
- Alan Verdugo on IBM Developer: building-a-hadoop-cluster-with-raspberry-pi
- Don’t Quit Your Day Job : “raspberry-pi-hadoop-cluster-apache-spark-yarn
- Jonas Widriksson : raspberry-pi-2-hadoop-2-cluster
- Because We Can Geek : building-a-raspberry-pi-hadoop-cluster-part-1