Choosing the right tools for the big data science job

Data is everywhere and it is being collected every second of every day. Each click of a mouse and every search term entered creates data.

With the advent of the Internet of Things, devices are collecting data too. This is a totally new dataset. The data lakes are pooling and with the advent of cost effective tools to analyse the data, now is the time to start doing something with it.

With new kinds of data comes new applications, and with that new businesses. These data-driven businesses are doing fantastic things with data and are one of the main reasons that data science is so in demand. When LinkedIn analysed global recruitment activity on its site over the course of 2015, it ranked “statistical analysis and data mining” as the second-hottest set of skills, after expertise in “cloud and distributed computing”. These data scientists are exploring, modelling, evaluating and automating data. They are mining for the insights that enable businesses to grow and thrive.

The hottest skills in town need the best tools. In the case of a data scientist, the tools of choice are R and Python as well as a healthy smattering of SQL. R is a language developed for statistical computing and has an unparalleled ability to sort through datasets and run predictive modelling. Python is a general-purpose language that’s ideal for parsing and iterating through data. SQL – Structured Query Language - is the de-facto standard way of accessing a relational database. These tools are all fantastic in their various use cases up to a point. The issue comes when you move to a larger dataset; R and Python allow you to interact closely with data, to get your hands dirty, play with the data, but how can you preserve this way of working when moving to larger datasets?

Today data scientists can be deprived of their strengths when moving to larger datasets – datasets in the realm of ‘big data’ - because large scale tools are too inflexible to support the data science style of working.

Michael Stonebreaker, winner of the Turing Award 2014. said: "…the change will come when business analysts who work with SQL on large amounts of data give way to data scientists, which will involve more sophisticated analysis, predictive modelling, regressions and Bayesian classification. That stuff at scale doesn’t work well on anyone’s engine right now. If you want to do complex analytics on big data, you have a big problem right now.”

How do data scientists overcome this? In the past they have fallen back to two compensation strategies. Either they have worked in a batch-oriented fashion but lost major components of their powerful style of interactive working, or they have utilised small subsets of the data, missing insights and lacking the ability to ‘drill down’ to the most granular levels. However, there are a new ways of working that can help. Combining the two top trends of recent years, namely “big data” and “data science”, the new field of “big data science” is blazing the trail for data scientists. It uses a new type of database that is designed to handle large data volumes while still delivering an agile, flexible and interactive feel that matches the exploratory style of a data scientist. It allows users to perform advanced analytics tasks on large volumes of data in an interactive fashion – right in the database using any programming language.

There are several considerations when selecting a database for this type of work:

  • Make sure the database is fast enough. In-memory technology, which was once cost-prohibitive, is now a necessity for achieving the required performance goals. An intelligent in-memory system with compression allows operations on datasets that are far larger than physical RAM without major decreases in performance.
  • Find something scalable: Massively Parallel Processing (MPP) is required for providing the scalability needed for big data applications. This means operations are performed in parallel across machines in a cluster, maximising performance.
  • In-database analytical programming through user-defined functions (UDFs) is needed to provide the required flexibility to allow data scientists to bring the algorithms to the data.

Big data science architectures result from the convergence of advanced in-memory, massively parallel processing and in-database programming. With the ability to write UDFs you can run analytic computations where the data resides and get answers fast - instead of exporting it to a separate location for processing and analysis. And as data volumes continue to grow over time, keeping the data in-database is incredibly advantageous from a management, processing and analysis perspective.

Big data science is revolutionising the way businesses generate value from data. It is providing the ability to create, deploy and interact with production quality data science models, right where the data is stored. But the tools need to be selected carefully to enable big data scientists to explore, model and evaluate without constraint.

Aaron Auld, CEO of analytic database company EXASOL

Photo Credit: Sergey Nivens / Shutterstock