Skip to main content

Data virtualisation: The key to getting the most from your data lake

(Image credit: Image source: Shutterstock/alexskopje)

Data is the lifeblood of every organisation, regardless of size or sector. Over the years, it has become a crucial part of doing business and, by harnessing it effectively, companies can use it to boost productivity and improve decision making.

However, maximising the potential of an organisation’s data is not necessarily a simple process. This is because with every action, reaction and interaction data is produced, resulting in an avalanche of information. And this is not set to change anytime soon, with IDC predicting worldwide data levels to grow 61 per cent to 175 zettabytes by 2025. With all these exabytes, petabytes and zettabytes of data floating around, it’s no wonder that the concept of the data lake has edged its way into wider business strategy for many organisations.

What to do with all that data

In order to make some sense of the never-ending stream of information being produced, organisations need to store and manage it in one consolidated central repository, rather than having it spread out across lots of different sources, databases and structures.

As a result, data lakes – with their infinite storage - have become a principal data management architecture. They hold all the data that could be of interest, whether structured or unstructured, in one place. This eases data discovery processes and reduces the amount of time spent by data scientists on selection and integration.

But that’s not all. When implemented correctly, data lakes can save businesses time and money whilst providing massive computing power, so data can be efficiently transformed and combined to meet the needs of any process. They also enable organisations to use machine learning in order to analyse historical data and forecast likely outcomes, further increasing overall productivity. The success of this model was illustrated in a recent analyst report, which found that those organisations employing a data lake typically outperform their peers by 9 per cent in organic revenue growth.

All that said, in order to reap the rewards of a data lake, organisations need to ensure that the data stored within it is easy to discover and define.

Finding the (data) needle in the (data lake) haystack

Enter data virtualisation...

Despite the clear benefits associated with a data lake, most businesses still find themselves struggling with certain aspects of data discovery and integration. This makes it difficult to glean any insight or intelligence from the data stored within a data lake.

It’s simple; having all your data in the same physical place doesn’t necessarily make discovery easy. What’s more, slow and costly replication of data from its systems of origin can mean that only a small subset of the relevant data will be stored in the lake. In fact, many companies have hundreds of repositories, distributed across several on-premise platforms, data centres and cloud providers. Therefore, any discovery solution needs to be able to span all potential data storage areas.

Another huge challenge with the data lake is that storing data in its original form does not remove the need to adapt it later for the machine learning process. This complex task is often left to data scientists, becoming a serious drain on their time. In fact, a recent study revealed that data scientists can spend up to 80 per cent of their time on finding, cleaning and reorganising data. That means that only 20 per cent of time is being spent on analysing any data produced.

In recent years, several data preparation tools have been developed in order to enable data scientists to carry out simple integration tasks. However, these tools are limited as they cannot help scientists with the more complex tasks that require a more advanced skillset. So, what’s the answer?

Unlocking the benefits of a data lake

Providing a single access point for all data in a data lake, data virtualisation allows for effective machine learning analysis and opens the door to a whole new world of possibilities for driving business value from data. It stitches together data abstracted from various underlying sources and delivers it to consuming applications in real time so that even data that has not been copied to the lake is available for data scientists to use and analyse.

Using data virtualisation to create an abstraction layer can transform a data lake so that it is no longer a ‘dumping ground’ full of data that is difficult to find and make sense of. Instead it can become a valuable asset, holding useful information and accessible business benefits.

Above all, it can address the two main problems faced by data scientists, enabling them to better discover and integrate data.

•             Discover data – The availability of data sets within a data virtualisation layer does not depend on them being replicated from an origin system. Therefore, new content can be added more quickly and at a lower cost. Using data virtualisation also means that it is possible to access all data in real time.  Simply put, data virtualisation allows data scientists to access more data. But that’s not all... the best-of breed tools will also offer searchable catalogues of all available data sets so data scientists can easily search for the information that will best improve their organisation’s processes.

•             Integrate data – With data vitualisation all data is organised according to a consistent data representation and query model regardless of whether it is stored in a relational database a Hadoop cluster, a SaaS application or a NoSQL system. This means that data scientists see it as if it were stored in the same place. Through this, it’s possible to make reusable logical data sets which data scientists can use to meet the needs of different machine learning processes and enable them to take care of complex issues such as transformation and performance optimisation.

With data playing an ever-increasing role in our lives and the machine learning market expected to grow by 44 per cent over the next four years, businesses will continue to look to modern analytics to drive meaningful insight and better their operations and processes.

By enabling data scientists to discover and integrate more data at a faster pace, data virtualisation is set to become the key for organisations looking turn their data lakes, and the data stored within them, into a valuable business asset.

Alberto Pan, CTO, Denodo (opens in new tab)
Image source: Shutterstock/alexskopje

Alberto Pan is Chief Technical Officer at Denodo and Associate Professor at University of A Coruña. He has lead Product Development tasks for all versions of the Denodo Platform. He has authored more than 25 scientific papers in areas such as data virtualisation, data integration and web automation.