Big data is reshaping the landscape of business IT. Thanks to cheap storage, the massive processing power of the latest technology, and tools like Hadoop, organisations are now able to mine terabytes of information and derive useful business intelligence from it.
However, there's a frightening shortfall in staff with the necessary skills and knowledge to implement big data.
A recent report by think-tank e-skills UK has highlighted the massive skills gap soon to open up in the world of big data, predicting that the big data skills force will have to increase by a whopping 243 per cent to fill the void. As such, those with skills in Hadoop are raking in high salaries and rising in importance in organisations across all sectors.
Here's how you can get a slice of the pie.
1. Brush up on your Java
Hadoop is written in Java and is optimised to run Map and Reduce tasks that were written in Java as well. If your Java is rusty, you may want to spend a few hours with your Java for Dummies book before you even begin looking at Hadoop.
Although most people familiar with Java familiarity will have fairly good coding knowledge, you may want to additionally brush up on your Object Oriented skills and have a clear understanding of concepts like Interfaces, Abstract Objects, Static Methods, and Variables, etc.
Although it may be frustrating to go back to nursery and re-read a Linux or Java "Dummies" book, you'll be well glad you did when you inevitably encounter some bizarre behaviour – maybe even a Pig or Hive query - and you need to look under the hood to debug the code and solve the issue.
2. Take a free course
You can take a free Hadoop Fundamentals course on The Big Data University. As part of the course you get Hadoop thrown in, in the form of IBM's BigInsights distributed form.
3. See it in action
Hortonworks, the makers of Hadoop, have put a whole glut of resources online for the budding big data scientist.
Anyone interested in using Hadoop can read detailed instructions on how to collect and process data and build applications; how to explore, query and deliver insights out of big data; as well as how to provision, manage and monitor Hadoop.
There are also a whole load of videos that show Hadoop in action, analysing sensor data, geolocation data and server logs, as well as a whole load of other inputs and data streams.
4. Mess around with Hortonworks Sandbox
Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive hands-on Hadoop tutorials.
The teaching program also includes many of the more exciting developments from the latest HDP distribution, and it's all packaged up in a virtual environment that you can get up and running in 15 minutes.
The Sandbox allows you to build a proof of concept for your project. You can also add your own datasets, and connect it to your existing tools and applications, as well as testing new functionality.
5. Work your way up gradually
Hadoop is able to run in three modes:
For the purposes of learning you can start with non-distributed mode which runs on a single machine.
You need to get into the habit of using small tester datasets on your local machine, and then running your code iteratively in Local Jobrunner Mode. This lets you locally test and debug your Map and Reduce code.
Next, move up to Psuedo-Distributed Mode, which more closely mimics the production environment. Finally, when you're Hadooping like a boss, you can graduate to Fully-Distributed Mode, which is your real production cluster.
By developing piecemeal in this way, you'll be able to get bugs worked out on smaller subsets of the data – then when you run on your full dataset with real production resources, you'll have all the kinks already worked out, and your job won't crash 75 per cent of the way in.
6. Run Hadoop in the cloud
If you want to experience running Hadoop as a fully-distributed cluster we recommend doing it on the cloud. BigDataUniversity.com has free courses on how to create your own Hadoop cluster on Amazon or IBM cloud. You can also get $25 credit from Amazon, to boot.