Skip to main content

Building data lakes for GDPR compliance

(Image credit: Image source: Shutterstock/alexskopje)

If there’s one key phenomenon that business leaders across all industries have latched onto in recent years, it’s the value of data. The business intelligence and analytics market continues to grow, with Gartner forecasting the market will reach $18.3 billion in 2017 (opens in new tab), at a massive rate as organisations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.

But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.

The European Union’s GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organisations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and – finally – its destruction.

As companies scramble to make the May 25th deadline (opens in new tab), data governance is a key focus. But organisations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organisations are having to create new policies that will help them achieve a privacy by design mode.

Diverse data assets

One of the great challenges posed in securely managing data is the rapid adoption in data analytics across businesses, as it moves from an IT office function, to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data - such as lineage of data (where it was created and how it got there).

Organisations may have personal data in many different formats and types (both structured and unstructured), across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organisations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment’s notice.

With the diverse sources and banks of data that many organisations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organisation, data lakes are a clear solution to the challenge of storing and managing disparate data.

Pool your data

This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organisations to store, manage and identify the source of all their data, data lakes provide a cost-effective means for organisations to store all their data in one place. On the other hand, managing this large volumes of data in a data warehouse has a far higher TCO.

A data lake is a storage method that holds raw data, including structured, semi-structured and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we’re seeing data lakes used to centralise enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems and more.

Data lakes, which use tools like Hadoop to track data within the environment, helps organisations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses which in the main maintain structured, processed data. Enabling organisations to discover, integrate, cleanse and protect data that can then be shared safely is essential for effective data governance.

Further to the view across the full expanse of the data lake, organisations can look upstream to identify the sources of data from before they flowed into the lake. That way, organisations can track specific data back to their source - like the CX or marketing applications - providing end-to-end visibility across their entire data supply chain so that it can be scrutinised and identified as necessary.

Setting the foundations

While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organisations’ journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate, and will be created to serve and adhere to the challenges they present.

However, with the demand on organisations to create data policies and practices that will support the compliance of their future data storage and analytics endeavours, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.

The clock is ticking

Time is running out for many organisations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonise and identify the source of their data in compliance with the GDPR.

GDPR is bringing new dimensions with respect to customers demand: now they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalised interactions, while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.

Jean-Michel Franco, Senior Product Marketing Director for Governance Products, Talend (opens in new tab)
Pinakin Patel, Senior Director EMEA Solutions Engineering,
MapR (opens in new tab)
Image source: Shutterstock/alexskopje

Jean-Michel Franco is the Senior Director of Data Governance Solutions at Talend.