The first step in any large-scale data integration project provides businesses and government departments alike with a lot of architectural choices to make when it comes to building new applications. Whether those applications are operational or analytical, the architectural choices that accompany them come with a long list of pros and cons.
While the perceived wisdom might be to build a solution from a variety of different components to increase flexibility and potentially lower costs, there is still a need to integrate all of these various elements together and enforce the necessary levels of governance and security.
The challenge is deciding what the best approach for this is. And this challenge, whether or not it’s faced by an enterprise or public sector department, is made even harder when you consider the volume of legacy data likely to exist across the organisation, which all too often is stored in silos.
Much of the data owned and stored by businesses and government departments alike is constrained by the silos it’s stuck in, many of which have been built over the years as organisations grow. When you consider the consolidation of both legacy and new IT systems, the number of these data silos only increases. What’s more, the impact of this is significant. It has been widely reported that up to 80 per cent of a data scientist’s time is spent on collecting, labelling, cleaning and organising data in order to get it into a usable form for analysis.
Successfully integrating data in order to interrogate it when faced with this dilemma is therefore not an easy process, but there are a number of options available to tackle this head on. However, as mentioned, it’s not necessarily immediately obvious which of these options is best. Data Hub? Data Lake? Virtual Database? What is clear though is that security is a paramount requirement. Unless security is addressed as a core principal rather than an afterthought, businesses and public sector departments leave themselves wide open for things to fall through the cracks, particularly when it comes to open source.
Data lake is an option
In fact, we saw evidence of what can happen when things go wrong just last summer, when a significant breach was identified by security researchers in a biometrics system widely used by banks as well as defence contractors and the Metropolitan Police. According to reports, the database used to store the facial recognition information and fingerprints of over one million people was discovered to be unprotected and largely unencrypted. This meant that the researchers had access to some 23GB of data, which reportedly also included security levels and clearance, as well as personal details of staff.
This goes to show that when it comes to highly sensitive information such as biometric data, which is increasingly being used by public sector authorities as well as private sector organisations, you cannot afford to take any short cuts. In so many of the publicly reported cases of data breaches, the platforms involved rely entirely upon network security alone. This is essentially the equivalent of putting up a fence around a secure facility but leaving the doors to the building unlocked.
So security is a pre-requisite, but what options are available when it comes to storing and managing data across multiple applications? One way to do this is through a Virtual Database (sometimes referred to as a Federated Database), which is a system that accepts queries and acts as a big database that includes many disparate siloed data sets. In reality, it queries back-end systems in real time and converts the data to a common format as it does so.
Another option is a Data Lake, a term well promoted in the Hadoop community. Data Lake has many definitions applied to it, but broadly speaking, were you to move all of your data from multiple silos into one single system (e.g. the Hadoop Distributed File System) it becomes a Data Lake. The data won’t necessarily be indexed, easily searchable or even usable, but it does eliminate the need to connect to a live system every time you need to access a record of something.
Hitting digital transformation targets
Then there is also a Data Hub option, which offers more of a ‘hub and spoke’ approach to data integration whereby the data is physically transferred into a new system and re-indexed, supporting data discovery, analytics and indexing.
What organisations need to consider when weighing up which option is best for them are the capabilities they provide for factors including movement (copying data from one location to another so it is co-located), harmonisation (transforming data to common formats in order to derive new insights), indexing (enabling robust search query capabilities) and governance (the ability to control and audit the use of data, which is essential for compliance with regulations such as GDPR).
However, appropriate governance capabilities are not just vital for adhering to GDPR and other legislation. If they are not present in the application, organisations will have to manage multiple versions of the same data set across numerous sets of infrastructures. This is the single most often missed aspect of planning for data integration programmes and can result in unforeseen costs and management overhead as departments and businesses scale.
From a public sector perspective, better use of data has been widely noted as an essential element in achieving the government’s digital transformation targets. The Department for Digital, Culture, Media and Sport (DCMS) is expected to publish its National Data Strategy in 2020, outlining the government’s vision for how data will be used across government departments, and the role it will play in helping to shape the economy until 2030.
New and emerging technologies such as AI, machine learning, IoT and advanced analytics are becoming more commonplace across central government departments, local authorities and private sector businesses. However, at the centre of all of this is data, which means the need for a data-first approach that prioritises security and maximises access has never been more important. What’s just as vital however, are the architectural choices taken in order to ensure the best levels of integration.
Chris Cherry, Public Sector Lead, MarkLogic