Unless you’ve been hiding away from the world of computing for the last few years, you’ll have come across Hadoop.
Apache Hadoop, to give it its full name, is an open source framework designed to handle the storage and processing of large amounts of data using low-cost commodity hardware. Since its initial release in 2011, it has become one of the most popular platforms for handling big data.
How does it work?
Hadoop grew out of Google File System, and it’s a cross-platform program developed in Java. It has four core components: Hadoop Common holds all of the libraries and utilities used by other modules, the Hadoop Distributed File System (HDFS) which handles storage, Hadoop YARN which manages computing resources, and Hadoop MapReduce which handles processing.
Hadoop works by splitting files into blocks and sharing them across a number nodes in a cluster. It then uses packaged code distributed across the nodes to process the data in parallel. This means that the data can be dealt with more quickly than it could be using a conventional architecture.
A typical Hadoop cluster will have both master nodes and slave or worker nodes. A master node is made up of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave node usually works as both a DataNode and TaskTracker, although in specialised applications it is possible to have data only and compute only salve nodes.
In large Hadoop clusters the HDFS nodes are managed using a dedicated NameNode server which hosts the file system index. This prevents loss of data and corruption of the file system.
The Hadoop file system
The Hadoop Distributed File System is at the core of Hadoop operation and is designed to be scalable and portable. The advantage of HDFS when dealing with big data is that it can store files at gigabyte or terabyte size across multiple machines. Because data is replicated across multiple hosts the storage is reliable and doesn’t need to have additional protection such as RAID. However, RAID storage is still sometime used with Hadoop to improve performance. Further protection is provided by allowing the main NameNode server automatically switch to a backup in the event of failure.
HDFS is designed to be mounted directly in a Filesystem in Userspace (FUSE) or a virtual file system on Linux systems. File access can be handled by a Java API. HDFS is designed for portability across hardware platforms and operating systems.
Hadoop can work with other file systems including FTP, Amazon S3 and Microsoft Azure, however, it needs a specific file system bridge in order to ensure no loss of performance.
Hadoop and the cloud
As well as traditional data centres, Hadoop is frequently deployed in the cloud. This has the advantage that companies can easily deploy Hadoop more quickly and with lower setup costs. Most of the major cloud vendors offer some kind of Hadoop offering.
Microsoft offers Azure HDInsight, allowing users to commission the number of nodes they require and get charged just for the computing power and storage they use. HDInsight is based on Hortonworks software and makes it easy to move data between in-house systems and the cloud for backups, or development and testing.
Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) also support Hadoop, plus Amazon offers an Elastic MapReduce product that automates the provisioning of Hadoop clusters, the running and terminating of jobs, and the handling data transfer between EC2 and S3 storage.
Google offers a managed Spark and Hadoop service called Cloud Dataproc, along with a range of shell scripts to create and manage Spark and Hadoop clusters. It supports third-party Hadoop distributions from Cloudera, Hortonworks and MapR. There are connectors to allow Google Cloud Storage to be used with Hadoop too.
The take up of Hadoop has been somewhat tentative. A Gartner study in 2015 showed that only 18 percent of respondents had plans to invest in Hadoop over the following two years. Reasons for reluctance to adopt the technology included costs being too high relative to expected benefits, and a lack of the necessary skills.
There are some high profile users though. Yahoo’s search engine is driven by Hadoop and the company has made the source code of the version it uses available to the public via the open-source community. Facebook too uses Hadoop and in 2012 the company announced its cluster had 100 petabytes of data and was growing at around half a petabyte each day.
Despite slow initial take up, Hadoop is growing. A survey by Allied Market Research at the start of 2016 estimated that revenue from the Hadoop market will be worth over $84 billion by 2021.
Because of the way Hadoop works it’s seen something of a return to then old days of batch processing information. Whilst it’s useful for drawing insights from large volumes of historical data it’s less effective for real-time applications or where’s there a continuous incoming data flow.
Hadoop has always been closely associated with big data. As the number of Internet of Things devices expands and the amount of data collected grows, so the demand for Hadoop’s processing capabilities will grow too. Its ability to process large volumes of data quickly will mean Hadoop systems becoming increasingly important to making day-to-day business decisions.
Organisations of all sizes are keen to take advantage of big data. The open source nature of Hadoop and its ability to run on commodity hardware means that its processing power is available not just to large corporations, so it should help to democratise the use of big data.
For all of this to work successfully companies need to be able to take advantage of the benefits Hadoop can deliver. This means that the skills gap will need to be addressed and staff with Java, Linux, file system and database backgrounds who are able to pick up Hadoop skills quickly are likely to remain in demand. It also means increasing use of the cloud to deliver Hadoop’s advantages in a less complex way.