Diving into the data lake

null

Business users of technology blissfully love how it makes their jobs easier, more efficient and fun, but often have little interest in how the different tools perform their magic. This is basic human nature. Most of us don’t know how our televisions and mobile phones work; our only concern is that they continue to provide what we expect of them. 

Technologists, on the other hand, thrive on the inner workings of technology solutions. Like all engineers, their human nature is to poke around and tinker with the design, fabrication and testing of new ways of doing something better. This is the case with today’s data lake, which is a marked upgrade from yesterday’s data warehouse-based solutions.   

While technologists could banter for hours to a business user about the nuanced differences between a data lake and a data warehouse, chances are the listener’s brain will short-circuit and explode. That’s because technologists often use abstruse terms in their speech—a rarefied language best understood by their peers. We’re not deliberately talking down to business leaders; it’s just how we speak (I, too, can be tech-jargon intensive).   

Why is this a problem and what does this have to do with a data lake? The answer is that people’s eyes blur over when technologists use words like “relational database” and “parallel processing” and “online analytical processing” in trying to make the case for a data lake. Even worse is when we resort to acronyms like SQL, OPAL, TDWI and RDBMS. My concern is that a cursory understanding of a data lake could result in a company passing on its value.    

So without further ado, here is my primer on data lakes, a sort of Data Lake for Dummies (forgive me). I’ve tried to strip out the tech-jargon and acronyms because I believe it’s extremely important that readers grasp the tremendous value a data lake provides. It’s not brain surgery, but it is complicated. 

Straight Talk 

At its most basic, a data lake is a repository of raw data coming in from a multiplicity of internal and external sources. By “raw,” the data has not been cleansed or transformed in any way. Once the data flows into the data lake, it becomes available for analytical purposes. To turn the lake into a reservoir of useful information, machine learning and artificial intelligence tools can help to do the filtering. 

A traditional data warehouse is also a repository of data. The chief difference (there are several) between the two is that data in a data warehouse has been cleansed for easy consumption. In developing the warehouse, much time is spent evaluating data sources to ensure they conform to specific business processes and user needs. In other words, before the data can be loaded into the warehouse, its use case is already modelled. The highly structured nature of a data warehouse can be altered, but it takes a lot of time to do that.     

Here’s another way to look at the differences. In a warehouse, products (data) are packaged up and put on a shelf. In a lake, diverse fish and other organisms (data) swim freely. The warehouse is great for storing historical and current data used to create reports. The lake is great for business users wanting to pull up whatever they need for modelling purposes—a few fish here, a few fish there, and a crab or two.

ALL data flows into a data lake, creating a really BIG data repository. Structured and unstructured data flow into the lake—even data that may not seem to have present value but may be useful in the future. The source of this data and its structure are irrelevant. By contrast, a data warehouse is not likely to be populated with non-traditional data from social media, as well as texts, images, video, and sensor-produced data in the fast-expanding Internet of Things (IoT). Theoretically, this is possible—but it would take too long to change the structure of the warehouse to accommodate the additional sources of data.   

With more types of data coming in from infinitely more sources at rapidly expanding volumes (just think about all those “smart” connected devices in the IoT), and more sophisticated artificial intelligence technologies coming on stream soon to make sense of this overflow, a data warehouse would buckle under the strain. An expansive data lake can easily absorb this torrent of evolving information sources. Armed with a good fishing pole (“advanced algorithms” in tech-jargon), users can obtain insights from a data lake at much faster speeds than a forklift lumbering through the warehouse looking for the right package on the right shelf. 

Tackle and Gear 

This is not to imply that a data lake is a “be all and end all” solution. While any and all forms of data can flow into a data lake, plumbing is needed to move the data out of the traditional business silos where it tends to reside.   

By connecting the hundreds of endpoints organisations have, creating a constant flow of millions of data elements into a data lake, teams are able to extract and combine data from diverse sources and silos to instantly transform it into useful information. This step of bringing the data together and piping it all into the lake in real time is critical before it can be filtered through analytics for users’ modelling needs.    

The value of data to a business cannot be underestimated. Astute decision-making requires access to insightful information. A lake is a source of water supply. A data lake is a reservoir of intelligence.    

Diletta D’Onofrio, Head of Digital Transformation at SnapLogic 

Image source: Shuttterstock / Bruce Rolff