Skip to main content

Simplifying machine learning through data virtualisation

(Image credit: Image Credit: Geralt / Pixabay)

It is often said that data is the new oil; a valuable commodity that drives the operations of businesses everywhere. So vast is the volume and variety of data that flows through today’s organisations that data lakes have now become one of the principal data management architectures.

Storing all data of interest – both structured and unstructured - in one central repository, a data lake, makes discovery easier and reduces the time spent by data scientists on selection and integration.  What’s more, a data lake provides massive computing power, allowing the data it holds to be transformed and combined to meet the needs of any processes that require it. The success of this model was illustrated in the findings of a recent analyst report (opens in new tab), which discovered that organisations employing a data lake were outperforming their peers by nine per cent in organic revenue growth.

However, when it comes to applying machine learning (ML) analytics as a means of gleaning insight and intelligence from the wealth of data held in these lakes, most businesses find themselves struggling with certain complexities of data discovery and integration. Indeed, with one study (opens in new tab) revealing that data scientists can spend up to 80 per cent of their time on these tasks, it is clearly time for a new approach.

Time for a change

Quite simply, having all your data in one physical place doesn’t necessarily make discovery easy; it can often be like looking for a needle in a haystack. What’s more, slow and costly replication of data from its systems of origin can mean that only a small subset of the relevant data will be stored in the lake. The waters are further muddied by the fact that companies may have hundreds of repositories distributed across a number of different cloud providers and on-premise databases.

Perhaps most significantly, storing data in its original form still requires it to be adapted for machine learning processes, but the burden now falls to data scientists who, while able to access the necessary processing capacity, tend not to have the skills required for integration.

The past few years have seen an emergence of data preparation tools designed to enable data scientists to carry out simple integration tasks, but there remain a number of complex tasks which require a more advanced skillset. In many cases, an organisation’s IT team may be called upon to create new data sets in the data lake specifically for ML purposes, and this can significantly slow progress.

If these issues are to be addressed and organisations are to unlock the full benefits of a data lake, new processes such as data virtualisation (DV) are needed. 

More data, more choice

Fundamentally, DV allows data scientists to access more data and do it in the format that best suit their needs.

It provides a single access point to any data, regardless of its location and native format, without the need to first replicate it in a single repository. By applying complex data transformation and combination functions on top of the physical data, DV provides different logical views of the same physical data without the need to create any additional replicas. In doing so, it offers a fast and inexpensive way of using data to meet the particular needs of different users and applications and can help to address some of the main challenges faced by data scientists.

The availability of data sets in a DV system does not depend on their being replicated from their origin systems, meaning that new content can be added more quickly than by using traditional methods, and at lower cost, ultimately allowing complete flexibility around what data is replicated: for example, it’s possible to access all data from all sources in real time for a particular task, while choosing to materialise all required data in the lake for another, and opting for a mixed strategy for a third task, wherein only a subset of the data is materialised. 

And not only is DV flexible, it’s selective too. Best-of breed DV tools will offer a searchable catalogue of all available data sets, including extensive metadata on each data set such a tags, column descriptions and information on who uses each data set, when and how.

Clear and simple

DV also offers clarity and simplicity to data integration processes. Regardless of whether data is originally stored in a relational database, a Hadoop cluster, a SaaS application or a NoSQL system, for example, DV tools will expose it according to a consistent data representation and query model, allowing data scientists to see it as if it were stored in a single relational database.

In addition, it allows for a clear and cost-effective separation of the responsibilities of IT data architects and data scientists. By employing DV, IT data architects can create ‘reusable logical data sets’ that expose information in ways useful for different specific purposes. What’s more, as there’s no need to physically replicate the data, it takes considerably less effort to create and maintain these logical data sets than with traditional methods. Data scientists can then adapt these reusable data sets to meet the individual needs of different ML processes and, by allowing them to take care of complex issues such as transformation and performance optimisation, data scientists can then perform any final, and more straightforward, customisations that might be required.

Unlocking the benefit

ML may still be in its relative infancy, but the market is expected to grow (opens in new tab) by 44 per cent over the next four years as businesses look to modern analytics as a means of driving operational efficiencies through ever more meaningful insight. As its adoption continues to grow, however, and data lakes become more prevalent, DV will become increasingly essential to improving the productivity of data scientists.

By enabling data scientists to access more data, leverage catalogue-based data discovery, and by simplifying data integration, DV will allow them to focus on their core skills rather than being burdened with data management. In turn, the organisation as a whole will enjoy the full benefits of the wealth of data it holds. 

Alberto Pan, Chief Technical Officer, Denodo (opens in new tab)
Image Credit: Geralt / Pixabay

Alberto Pan is Chief Technical Officer at Denodo and Associate Professor at University of A Coruña. He has lead Product Development tasks for all versions of the Denodo Platform. He has authored more than 25 scientific papers in areas such as data virtualisation, data integration and web automation.