For organisations looking to leverage large data sets to gain insight and competitive advantage, Hadoop stands ready as the de facto standard for processing big data. Still, adopting a Hadoop distribution is a major decision, requiring a high level of attention and scrutiny to make sure that the chosen platform supports mission critical applications, generates maximum ROI and meets current and future needs.
To help facilitate the selection process, Robert D. Schneider (the author of Hadoop for Dummies) has just released an eBook titled the Hadoop Buyer's Guide. In the guide, the author discusses four critical considerations when selecting a Hadoop platform, each of which is highlighted below.
Performance and scalability
To maximise performance, Schneider lists a number of specific features or "critical architecture preconditions" that should be present in the chosen Hadoop environment. Among these prerequisites are:
- Minimal software layers – The more 'moving parts' or software layers a system has, says Schneider, the more performance is compromised. Having to "navigate a series of separate layers such as HBase Master and RegionServer, the Java Virtual Machine, and the local Linux file system" can jeopardise performance and reliability.
- A single environment platform for all big data applications – Many Hadoop implementations require administrators to create separate instances to handle additional workloads. Schneider recommends that organisations and IT select a single environment platform that is capable of handling the full spectrum of workloads that will likely be experienced.
- Ability to leverage the elasticity and scalability of popular public cloud platforms – To maximise performance, the Hadoop distribution needs to run on widely adopted cloud environments such as Amazon Web Services and Google Compute Engine. Running only inside the enterprise firewall is not enough.
In addition to performance architecture, Schneider also stresses the advantages of Streaming Writes, a critical functionality for organisations looking to leverage real-time Hadoop-based decision-making. He also explains the importance of Hadoop's scalable platform, which allows organisations to more fully capitalise on their big data without going over budget.
In discussing dependability, the author explains that to reduce the burden on users and administrators, the best Hadoop infrastructure should be capable of handling "the inevitable problems encountered by all production systems." He then discusses several foundational principles that can help increase Hadoop distribution dependability. Amongst these are the elimination of 'moving parts' such as RegionServers and HBase Masters. Reducing manual tasks, such as compactions and manual presplitting, are also said to increase overall dependability. Schneider also discusses data integrity, protection and disaster recovery functionalities that can help heighten Hadoop platform dependability.
According to the author, while Hadoop in the early days required sophisticated developers to manage multiple Hadoop environments, this is no longer feasible or necessary. A number of today's Hadoop platforms are intelligently designed to ease administrative burdens, and these platforms are the ones to seek out. Schneider emphasises that the quality and depth of management tools for administration and monitoring can vary from one Hadoop distribution to another. Thus, due diligence should be exercised in determining which platform will be the easiest to manage.
In order to more fully exploit the potential value of massive amounts of data, the guide book emphasises the importance of selecting a Hadoop platform that simplifies and speeds up the ingestion and extraction of information while allowing existing applications to easily connect to Hadoop's data. A number of ways to enhance data access are discussed in the eBook, including desired architectural foundations. As the author points out, the overall goal for organisations is to make sure that the selected Hadoop platform interacts smoothly with the rest of the IT environment.
Robert D. Schneider's guide is by no means a definitive resource. However, for organisations considering Hadoop implementation, the eBook offers practical information, making it a great place to start. Along with identifying and discussing the critical considerations when selecting a Hadoop platform, the book includes comparisons of major Hadoop distributions, all of which could prove useful in clarifying and simplifying the selection process. For more information, you can check out his webinar, titled Hadoop or Bust: Key Considerations for High Performance Analytics Platform.
Michele Nemschoff is vice president of corporate marketing at big data platform solutions firm MapR Technologies.