Optimising data science and big data

This year we can expect to see the data storage and data science landscapes mature, break our trust, and surprise us.

Last year was a big year for big data, and enterprises are increasingly showcasing more production uses of Hadoop, demonstrating more successes with a broader array of big data technologies, and solidifying innovative ways of using data science to improve business outcomes.

At the same time, innovators are continuing to push the machine learning boundaries, introducing new techniques that require less human intervention, like deep learning.

With that in mind, below are our data science and data-related predictions for the C-Suite to be aware of this year and in years to come.

Taking action on data, not just storing it: Hadoop adoption and growth continues

We will see increasing numbers of companies doing more with the data they have in Hadoop. They will be performing more analytics and running more applications - adding greater utility to the data from which they already derive value.

This will be partially enabled by further and faster penetration of SQL on Hadoop, as these environments continue to allow more and more data to shift from traditional stores, co-locate into one environment, and become accessible to downstream apps and learning algorithms.

The shift from “One Algorithm Wonders” (Point Solutions) to data science platforms

Within data science, there is a complex market evolving, and one analyst recently delivered a snapshot of this “machine intelligence” landscape. There will still be an emphasis on hiring data scientists, and a number of specialised new apps will power unique projects in the market, while introducing some challenges.

Data from the Internet Of Things (IoT) shows more ubiquity

Harvard Business School Professor, Michael E. Porter, co-authored an article “How Smart, Connected Products are Transforming Competition.” One of the key tenets is that the expansive use of built-in sensing and computing capabilities are transforming industry structure.

This is changing the way value is captured and managed, redefining channels, and totally transforming the businesses that various companies operate in. For example, automotive manufacturers are becoming more software-led, as we saw at CES.

With this in mind, we will see many additional interesting IoT applications gain adoption this year. The concept of automatons (modern, self-operating machines) and robotics will blend, and new, more basic robotic applications will be brought to life.

[Big] Data science will shift to the Cloud(s)

Today, public clouds provide easy access to compute environments, but they do this with limited support for truly Big Data and limited access to sophisticated modeling tools that support distributed architectures.

As enterprises scale their Hadoop infrastructure into private PaaS contexts, and data science tools are certified to these environments, ad hoc cloud support for their internal data scientists will become increasingly commonplace.

Realising Hadoop alone Is not enough

A common question our data science teams receive is, “We have Hadoop. Can this power our Data Science?”

The answer is usually “no”, but a more nuanced response will acknowledge the type of data and data science work intended to be performed. This will be the epiphany for enterprise IT teams this year. More and more companies will realise they need more than Hadoop to power the complex, changing requirements of data science discovery.

For data science apps that will continue to grow in ubiquity, colocation of new sets of data that are injected into existing models, or running various preprocessing calculations, we need a data lake architecture to house as much data as needed, all in one place.

Making machine learning more accessible: New tools and dangers

Many are struggling to source the skills they need to meet demand according to research, finding that there is a shortage of data analysts who can transform big data into commercial value.

With this in mind, both university and for-profit educators are making moves to address this gap. Of course, we are also seeing how vendors are increasing their interest in making machine learning available beyond data scientists.

Today, the overall data science process includes extraction, transformation, model building, and scoring, and separate tools typically address a single point in this chain. However, the solution needs to be looked at holistically, and building tools for people without the academic background or real-world experience can get companies in big trouble.

Moving forward, as data science practitioners face fewer doubters and are technically empowered with driving new usage of existing and new sources of data, these practitioners will drive their share of big, unexpected, and cost effective wins.

There will also be a group of companies that struggle to get Hadoop right, use data science without the proper guardrails, and either spectacularly and publicly reveals this, or more quietly start competing less effectively with their early-moving, technically thoughtful and strategic competitors.

There will doubtless be more debate but ultimately, these are our predictions for what will be on the agenda for CIOs in terms of delivering and deciding on data strategy this year.

Michael Natusch is Head of Data Science at Pivotal EMEA.