Skip to main content

How to avoid seven common Hadoop mistakes

(Image credit: Image Credit: Hadoop)

Hadoop, for all its strengths, is not without its difficulties. Business needs specialised skills, data integration, and budget all need to factor into planning and implementation. Even when this happens, a large percentage of Hadoop implementations fail. 

To help others avoid common mistakes with Hadoop, I asked our consulting services and enterprise support teams to share their experiences working with organisations to develop, design and implement complex big data, business analytics or embedded analytics initiatives. These are their top 7 mistakes, and some advice on how to avoid them.

Mistake 1: Migrate everything before devising a plan

As tempting as it can be to dive head first into Hadoop, don’t start without a plan. Migrating everything without a clear strategy will only create long-term issues resulting in expensive ongoing maintenance. With first-time Hadoop implementations, you can expect a lot of error messages and a steep learning curve. 

Successful implementation starts by identifying a business use case. Consider every phase of the process – from data ingestion to data transformation to analytics consumption, and even beyond to other applications and systems where analytics must be embedded. It also means clearly determining how Hadoop and big data will create value for your business. 

My advice: Maximise your learning in the least amount of time by taking a holistic approach and starting with smaller test cases. Like artisan gin, good things come in small batches!

Mistake 2: Assume relational database skillsets are transferable to Hadoop

Hadoop is a distributed file system, not a traditional relational database (RDBMS). You can’t migrate all your relational data and manage it in Hadoop the same way, nor can you expect skillsets to be easily transferable between the two.

If your current team lacks Hadoop skills, it doesn’t necessarily mean you have to hire all new people. Every situation is different, and there are several options to consider. It might work best to train up existing people and add a few new. You might be able to plug skills gaps with point solutions in some instances, but growing organisations tend to do better in the long run with an end-to-end data platform that serves a broad spectrum of users.

My advice: While Hadoop does present IT organisations with skills and integration challenges, it’s important to look for software, along with the right combination of people, agility, and functionality to make you successful. More tools are now available that automate some of the more routine and repetitive aspects of data ingestion and preparation, for example.

Mistake 3: Treating a Hadoop data lake like a regular database

You can’t treat a data lake on Hadoop just like a regular database in Oracle, HP Vertica, or a Teradata database, for example. Hadoop’s structure is totally different. It also wasn’t designed to store anything you’d normally put on Dropbox or Google Drive. A good rule of thumb is: if it can fit on your desktop or laptop, it probably doesn’t belong on Hadoop!

Data in a lake exists in a very raw form. Think of a box of Legos: you have what you need in it to build a Star Wars figurine, but it’s not a figurine out of the box. People imagine a data lake to be pristine, clear, and easy to find. But as your organisation scales up to onboard hundreds or more data sources, in reality they often end up being three miles wide, two inches deep and full of mud! IT time and resources can easily get monopolised, creating hundreds of hard-coded, error-prone data movement procedures. 

My advice: Take the proper steps up front, in order to understand to best ingest data to get a working data lake. Otherwise, you’ll end up with a data swamp. Everything will be there, but you won’t be able to derive any value from it. 

Mistake 4: I can figure out security later

High profile data breaches have motivated most enterprise IT teams to prioritise protecting sensitive data. If you’re considering using big data, it’s important to bear in mind that you’ll be processing sensitive data about your customers and partners. Never, ever, expose credit card and bank details, national insurance numbers, proprietary corporate information and personally identifiable information about clients, customers or employees. Protection starts with planning ahead, not after deployment.

My advice: Address each of the following security solutions before you deploy a big data project:

  • Authentication: Control who can access clusters and what they can do with the data
  • Authorisation: Control what actions users can take once they’re in a cluster
  • Audit and tracking: Track and log all actions by each user as a matter of record
  • Compliant data protection: Utilise industry standard data encryption methods in compliance with applicable regulations
  • Automation: Prepare, blend, report and send alerts based on a variety of data in Hadoop
  • Predictive analytics: Integrate predictive analytics for near real-time behavioural analytics
  • Best practices: blending data from applications, networks and servers as well as mobile, cloud, and IoT data

Mistake 5: The HiPPO knows best

HiPPO is an acronym for the "highest paid person's opinion." Trusting one person’s educated opinion over data may work occasionally, but Hadoop is complex and requires strategic inquiry to fully understand the nuances of when, where, and why to use it. To start, it’s important to understand what business goals you’re trying to reach with Hadoop, who will benefit, and how the spend will be justified. Most big data projects fail because the business value is not being achieved. 

Once a data problem has been established, next determine whether or not your current architecture will help you achieve your big data goals. If you’re concerned about exposure to open source or unsupported code, it may be time to explore commercial options with support and security. 

My advice: Once a business need for big data has been established, decide who will benefit from the investment, how it will impact your infrastructure, and how the spend will be justified. Also, try to avoid “science projects” - technical exercises with limited business value. 

Mistake 6: Bridging the skills gap with traditional ETL

Plugging the skills gap can be tricky for organisations considering how to solve big data’s ETL challenges. There just aren’t enough IT pros with Hadoop skills to go around. On the other hand, some programmers proficient in Java, Python, and HiveQL, for example, may lack the experience to optimise performance on relational databases. When Hadoop and MapReduce are used for large scale traditional data management workloads such as ETL, this problem intensifies.

Some point solutions can help plug the skills gap, but these tend to work best for experienced developers. If you’re dealing with smaller data sets, it might work to hire people who’ve had the proper training on big data and traditional implementations, or work with experts to train and guide staff through projects. But if you’re dealing with hundreds of terabytes of data, for instance, then you’ll need an enterprise-class ETL tool as part of a comprehensive business analytics platform. 

My advice: Technology only gets you so far. People, experience, and best practices are the essential for successful Hadoop projects. When considering an expert or a team of experts as permanent hires or consultants, you’ll want to consider their experience with “traditional” as well as big data integration, the size and complexity of the projects they’ve worked on, the companies with whom they worked with, and the number of successful implementations they’ve done. When dealing with very large volumes of data, it may be time to evaluate a comprehensive business analytics platform that’s designed to operationalise and simplify Hadoop implementations.

Mistake 7: I can get enterprise-level value on a small budget

The low-cost scalability of Hadoop is one reason why organisations decide to use it. But many organisations fail to factor in data replication/compression (storage space), skilled resources, and overall management of big data integration of your existing ecosystem.

Remember, Hadoop was built to process a variety of enormous data files that continue to grow. And once data is ingested, it gets replicated! For example, if you have 3TB you want to bring in, that will immediately require 9TB of storage space, because Hadoop has built-in replication (which is part of the parallel processing that makes Hadoop so powerful.) 

So, it’s absolutely essential to do proper sizing up front. This includes having the skills on hand to leverage SQL and BI against data in Hadoop and to compress data at the most granular levels. While you can compress data, it’s important to note that data compression affects performance. The compression of data also needs to be balanced with performance expectations for reading and writing date. Also, storing the data may cost 3x more than what you’ve initially planned. 

My advice: Understand how storage, resources, growth rates, and management of big data will factor into your existing ecosystem before you implement.

Wael Elrifai, Sr. Director of Sales Engineering, Pentaho (opens in new tab)
Image Credit: Hadoop

Wael Elrifai, Sr. Director of Sales Engineering, Pentaho, holds graduate degrees in engineering and economics. Memberships include Association for Computing Machinery, SIG for AI, Royal Economic Society, and Chatham House.