Making decisions with data – still looking for a needle in the big data haystack?

Many big data projects aren’t currently getting beyond the initial data collection phase.

Big data is getting big. Really big. IDC estimates that the amount of money spent on big data solutions worldwide was about $122 billion in 2015, and this figure is due to grow to $187 billion by 2019. About 29 per cent of companies have invested in NoSQL databases or are planning their deployments already, while 12 per cent plan to expand their implementations further, according to research by Forrester.

This spending points to how much value company leaders and their IT teams think is locked up in their data. However, these implementations don’t create value on their own. They need analytics to make sense of them. Finding valuable insights hidden in the data becomes more challenging as it grows larger and larger in size and scope. 

The biggest challenge is that many big data projects aren’t currently getting beyond the initial data collection phase. Amassing data is great if you know what your aims for this information will be. However, many IT leaders are then reliant on hiring data scientists to find insights that justify all this data being kept. 

These insights may or may not be there in the first place, so the investment can start to look like it won’t generate the necessary returns. One of the problems of finding a “Eureka moment” is that it is a manual process reliant on a significant degree of human intervention. Typically, an analyst or data scientist must know the right questions to ask. Without this experience, finding valuable insights can feel like looking for a needle in a haystack. Months can go by without making any progress. In searching for new insights through big data, it’s paradoxical that these impacts can’t be anticipated or predicted themselves.

Gartner estimates that about two-thirds of all big data projects are doomed to failure over the next two years, with initially promising implementations never getting beyond the pilot phase or experimentations. This project failure rate should be hugely concerning for all.

Finding data that drives business goals

To avoid these problems, it’s important to prioritise organisational goals first and work back to find data that can support these ambitions. Focusing on specific goals can ensure that the company is capturing the right data from its operations in the first place, rather than hanging on to data that is not going to get used. 

For CIOs, this means looking at big data in the round. Rather than using Hadoop as a dumping ground for large volumes of data, it’s worth considering all the sources of data that are being created across the business. Data might be held in traditional data warehouses, within specific applications or sourced from external suppliers. Each of these can be brought together and used by a business department.

The challenge is to move away from each department maintaining its own sets of data or simply stuffing them into Hadoop with the idea to come back later. Instead, looking at objectives should help departments scale up to use big data. This means looking at how each department measures its success, and then how data sources can be queried to help improve that success over time.

This process puts a lot more emphasis on collaboration and teamwork. Rather than relying on data scientists to sift through big data for a hugely significant insight, the aim here is to create repeatable processes that both business decision makers and IT analysts can use. By focusing on how the business side will use data over time, rather than how specific business analysts or IT specialists will look at data, it’s possible to expand the number of people who will benefit.

The automation of analytics

To get to this, more back-end work with data preparation will be required. Typically, data preparation is carried out by those who have experience using analytics in the first place. This tends to be manual and time-intensive work as well. Blue Hill Research estimates that about 28 per cent of work carried out by data analysts is exclusively data preparation. 

This process is still necessary to get data ready for people to use, but it inevitably requires familiarity with fundamental data concepts such as tables, columns, keys, joins and relationships to manipulate and explore data. This puts self-service analytics out of the reach of most non-technical business people, as they still don’t have access to tools that enable them to work with data independently.

Much more of this preparation stage can and should be automated.  Part of this automation process will be the deployment of machine learning algorithms to improve the data preparation process. Rather than relying on human intervention to provide clean and useful data, machine learning can take over some of this grunt work, anticipating what data sources and relationships between them might be important. 

There is also an opportunity for machine learning to automate the analysis process itself. We are now seeing the emergence of analytics technologies capable of automatically profiling data and identifying statistically significant relationships within it. The insights generated by these tools can answer questions that a human might not have thought of asking in the first place. 

Building a needle-finding machine

Based on machine learning and better data preparation tools, it’s now easier to get more insight into what is hidden within data lakes. By making it easier to use data, analytics around big data can start to move out from specialist roles within IT out to business analysts and on to more mainstream business users. 

As part of this process, it’s worth looking at how users can get their hands on data and use it within their decision-making processes. Partly, this will involve understanding what data is getting saved within these data lakes. 

For example, IoT devices can put out a lot of information from their sensors over time. This data can be useful for specific reporting – a logistics company may use sensor data from vehicles and goods to determine how the company’s drivers are performing against targets in real time. However, other departments may be able to use this data too for their own planning purposes.

The first element to consider here is that this information is not immediately useful in its raw format for business users. Time-series data does require some refinement to prepare it for analysis. By understanding the connections between data points being gathered, it is possible to make it easier to consume that data. Using semantic terms to describe sets of data can help here, while machine learning techniques can help put data into the right context as well.

In this example, data on efficiency of deliveries in the real world compared to predictions can be used to see if planning processes are accurate; if they aren’t then looking at staffing allocations might be in order. At the same time, investment in particular types of goods or locations within the country can be used to predict where there are better opportunities for profitability.

From the original time-series data, it is possible to create more business context to help decision-making. Rather than simply relying on data scientists to find those valuable needles in the haystack, this approach uses data to help everyday business users to find insights that will help them over time. By networking different sets of data together, applying smarter data management techniques and automating some of the refinement processes, everyone can and should be able to benefit from big data.

Pedro Arellano, vice president, product strategy, Birst
Image source: Shutterstock/Carlos Amarillo

ABOUT THE AUTHOR

Pedro Arellano is vice president, product strategy at Birst, leading development around networked data and analytics. Prior to Birst, he led marketing at MicroStrategy and hosted the Stereo Gol radio show.