How the right data lineage tools can cinch data integrity in a big data environment

(Image credit: Image Credit: Pitney Bowes Software)

Value, velocity, variety and veracity are four of the five keys to making big data a huge business.

However, the V that outweighs all others, is volume. And in a world where about 2.5 quintillion bytes of data are created every single day, it can be a challenge applying all the usual methodologies and technologies to releasing the value of that data at such scale.

Compounding the problem, with big data often comes big noise.  After all, the more information you have, the more chance that some of that information might be incorrect, duplicated, outdated or otherwise flawed. Today’s businesses are surrounded by a wide array of data—from operational to sensor data, web interaction to mobile data, static to streaming data. This is a challenge that most data analysts are prepared for, but one that IT teams need to consider and factor into their downstream processing and decision making to ensure that any bad data does not skew the resulting insights.

With the right analysis, this unparalleled breadth and depth of information can help companies better understand and engage with customers, take advantage of new market opportunities, optimise operations and much more.

However, the more business decisions are driven by data, the greater the pressure on IT to deliver high-value data infrastructure and big data integration projects quickly.  This is why overarching big data analytics solutions alone are not enough to ensure data integrity in the era of big data. In addition, while new technologies like AI and machine learning can help make sense of the data en masse, often these rely on a certain amount of cleaning and condensing going on behind the scenes to be effective and able to run at scale.

With many important business decisions hinging on data, it is crucial that that data is as accurate as it possibly can be.  Even partial data inaccuracy can have disastrous consequences for the company’s long-term goals.  In a survey by KPMG, however, of senior level executives, it was telling that only 35 per cent said they had a high level of trust in the way their organisation uses data and analytics.

While accounting for some errors in the data is fine, being able to find and eliminate mistakes where possible is a valuable capability – particularly if there is a configuration error or problem with a single data source creating a stream of bad data which can have a catastrophic effect in terms of derailing effective analysis and delaying the time to value.

Without the right tools, these kinds of errors can create unexpected results and leave data professionals with an unwieldy mass of data to sort through to try and find the culprit.

This problem is compounded when data is ingested from multiple different sources and systems, each of which may have treated the data in a different way. The sheer complexity of big data architecture can turn the challenge from finding a single needle in a haystack to one more akin to finding a single needle in a whole barn.

Meanwhile, this problem has become one that doesn’t just affect the IT function and business decision making, but is becoming a legal requirement to overcome. A year ago, GDPR legislation came into force and dictated that businesses must find a way to manage and track all of their personal data, no matter how complicated the infrastructure or unstructured the information. In addition, upon receiving a valid request, organisations need to be able to delete information pertaining to an individual, or collect and share it as part of an individual’s right to data portability.

So, what’s the solution? One of the best solutions for managing the beast of big data overall is also one that builds in a way to ensure data integrity – ensuring a full data lineage by automating data ingestion. This creates a clear path showing how data has been used over time, as well as its origins.

In addition, this process is done automatically, making it much easier and more reliable. However, it is important to ensure that lineage is done at down to the fine detail level. WhereScape automation software, for example, can retrospectively go out and catalogue data sources and easily enable complex data extraction while ensuring compliance with GDPR requirements.

With the right data lineage tools, ensuring data integrity in a big data environment becomes far easier. The right tracking means that data scientists can track data back through the process to explain what data was used, from where, and why.

Meanwhile, businesses can track down the data of a single individual, sorting through all the noise to fulfil subject access requests without disrupting the big data pipeline as a whole, or diverting significant business resource. As a result, analysis of big data can deliver more insight, and thus more value, faster – despite its multidimensional complexity.

Neil Barton, Chief Technology Officer, WhereScape
Image Credit: Pitney Bowes Software