More than a decade ago, we entered an era of data deluge. Data continues to explode - it has been estimated that for each day of 2012, more than 2.5 exabytes (or 2.5 million terabytes) of data were created. Today, the same amount of data is produced every few minutes!
One reason for this big data deluge is the steady decrease in the cost per gigabyte, which has made it possible to store more and more data for the same price. In 2004, the price of 1 GB of hard disk storage passed below the symbolic threshold of $1. It's now down to three cents (view declining costs chart). Another reason is the expansion of the Web, which has allowed everyone to create content and companies like Google, Yahoo, Facebook and others to collect increasing amounts of data.
Big data systems require fundamentally different approaches to data governance than traditional databases. In this post, I'd like to explore some of the paradigm shifts caused by the data deluge and its impact on data quality.
The Birth of a Distributed Operating System
With the advent of the Hadoop Distributed File System (HDFS) and the resource manager called YARN, a distributed data platform was born. With HDFS, very large amounts of data can now be placed in a single virtual place, similar to how you would store a regular file on your computer. And, with YARN, the processing of this data can be done by several engines such as SQL interactive engines, batch engines or real-time streaming engines.
Having the ability to store and process data in one location is an ideal framework to manage big data. Consulting firm, Booz Allen Hamilton, explored how this might work for organizations with its concept of a “data lake”, a place where all raw or unmodified data could be stored and easily accessed.
While a tremendous step forward in helping companies leverage big data, data lakes have the potential of introducing several quality issues, as outlined in an article by Barry Devlin: In summary, as the old adage goes, "garbage in, garbage out".
[caption id="attachment_112186" align="aligncenter" width="800"]
Don't let the inside of your data centre looks like this[/caption]
Being able to store petabytes of data does not guarantee that all the information will be useful and can be used. Indeed, as a recent New York Times article noted: “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
Another similar concept to data lakes that the industry is discussing is the idea of a data reservoir. The premise is to perform quality checks and data cleansing prior to inserting the data into the distributed system. Therefore, rather than being raw, the data is ready-to-use.
The accessibility of data is a data quality dimension that benefits from these concepts of a data lake or data reservoir. Indeed, Hadoop makes data and even legacy data accessible. All data can be stored in the data lake and tapes or other dedicated storage systems are no longer required. Indeed, the accessibility dimension was a known issue with these systems.
But distributed systems also have an intrinsic drawback, the CAP theorem. The theorem states that a partition-tolerant system can't provide data consistency and data availability simultaneously.
Therefore, with the Hadoop Distributed File System - a partitioned system that guarantees consistency - the availability dimension of data quality can’t be guaranteed. This means that the data can't be accessed until all data copies on different nodes get synchronized (consistent).
Clearly, this is a major stumbling block for organizations that need to scale and want to immediately use insights derived from their data.
As Marissa Mayer from Google says: “speed matters”. A few hundreds of milliseconds of delay in the reply to a query and the organization will lose customers. Finding the right compromise between data latency and consistency is therefore a major challenge in big data, although the challenges tend to apply only in the most extreme situations as innovative technologies appear over time in order to tackle it.
Co-location of Data and Processing
Before Hadoop, when organizations wanted to analyze data stored in a database, they could get it out of the database and put it in another tool or another database to conduct analysis or other tasks. Reporting and analysis are usually done on a data mart which contains aggregated data from operational databases. As the system scales, they can't be conducted on operational databases which contain the raw data.
With Hadoop, the data remains in Hadoop. The processing algorithm to be applied to the data can be sent to the Hadoop Map Reduce framework. And the raw data can still be accessed by the algorithm. This is a major change in the way the industry manages data: The data is no longer moved out of the system in order to be processed by some algorithm or software. Instead, the algorithm is sent into the system near the data to be processed. Indeed, the prerequisite to reap this benefit is that applications can run natively in Hadoop.
For data quality, this is a significant improvement as you no longer need to extract data to profile. You can then work with the whole data rather that with samples or selections. In-place profiling combined with BI Data systems opens new doors for data quality. It's even possible to think about some data cleansing processes that will take place in the big data framework rather than outside.
With traditional databases, the schema of the tables is predefined and fixed. This means that data that does not fit into the schema constraints will be rejected and will not enter the system. For example, a long text string may be rejected if the column size is smaller than the input text size.
Ensuring constraints with this kind of "schema-on-write" approach surely helps to improve the data quality, as the system is safeguarded against data that doesn’t conform to the constraints.
Of course, very often, constraints are relaxed for one reason or another and bad data can still enter the system. Most often, integrity constraints such as the no null value constraint are relaxed so that some records can still enter the system even though some of their fields are empty.
However, at least some constraints dictated by a data schema may mandate a level of preparation before data goes into the database. For instance, a program may automatically truncate too large a text data or add a default value when the data cannot be null, in order to still enter it into the system.
Big data systems such as HDFS have a different strategy. They use a "schema-on-read" approach. This means that there is no constraint on the data going into the system. The schema of the data is defined as the data is being read. It's like a “view” in a database. We may define several views on the very raw data, which makes the schema-on-read approach very flexible.
However, in terms of data quality, it's probably not a viable solution to let any kind of data enter the system. Letting a variety of data formats enter the system requires some processing algorithm that defines an appropriate schema-on-read to serve the data.
For instance, such an algorithm would unify two different date formats like 01-01-2015 and 01/01/15 in order to display a single date format in the view. And it could become much more complex with more realistic data. Moreover, when input data evolves and is absorbed in the system, the change must be managed by the algorithm that produces the view.
As time passes, the algorithm will become more and more complex. The more complex the input data becomes, the more complex the algorithm that parses, extracts and fixes it becomes - to the point where it becomes impossible to maintain.
Pushing this reasoning to its limits, some of the transformations executed by the algorithm can be seen as data quality transformations (unifying the date format, capitalizing names, …). Data quality then becomes a cornerstone of any big data management process, while the data governance team may have to manage ”data quality services” and not only focus on data.
On the other hand, the data that is read through the "views" would still need to obey most of the standard data quality dimensions. A data governance team would also define data quality rules on this data retrieved from the views.
It raises the question of the data lake versus the data reservoir. Indeed, the schema on read brings huge flexibility to data management, but controlling the quality and accuracy of data can then become extremely complex and difficult. There is a clear need to find the right compromise.
We see here that data quality is pervasive at all stages in Hadoop systems and not only involves the raw data, but also the transformations done in Hadoop on this data. This shows the importance of well-defined data governance programs when working with big data frameworks.
By Sebastiao Correia, director of product development for Data Quality at Talend