In 2020, the entire digital universe is expected to reach 44 zettabyte and by 2025, it’s estimated that 463 exabytes of data will be created each day globally according to World Economic Forum. With every day that passes organisations find themselves facing a bigger data management challenge than they had the before.
Since its inception in the late 1980s, data warehouse technology continued to evolve and MPP (Massively Parallel Processing) architectures brought about systems able to handle larger data sizes. And while data warehouses are great for structured data, nowadays enterprises have to handle unstructured and semi-structured data, as well as data with high variety, velocity and volume. In many of these use cases, data warehouses are not suitable and they are definitely not the most cost efficient.
As companies started collecting large amounts of data from numerous different sources, architects began envisioning a single system to house data for many different analytic products and workloads.
A big step forward for data
About a decade ago, companies began building data lakes, "depots” for raw data in a variety of formats. But while suitable for storing data, data lakes do not support transactions nor enforce data quality. Additionally, their lack of consistency makes it almost impossible to mix appends and reads, and batch and streaming jobs. This means that many of the data lakes’ promises haven’t materialised, while also losing many of the benefits of data warehouses.
In fact, companies still require systems for diverse data applications such as real-time monitoring, machine learning, data science, and SQL analytics. Furthermore, most of the recent advances in AI have been around better models to process unstructured data, which is exactly the type of data that a data warehouse is not optimised for.
To solve these issues, enterprises tend to use a combination of multiple systems that include data lakes, data warehouses and other specialised systems such as image databases and streaming. However, utilising a multitude of systems leads to complexity and delays, as data professionals have to move or copy data between different systems.
Moving to a lakehouse paradigm
Today, new systems addressing the limitations of data lakes have emerged. A lakehouse is a new paradigm combining the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design which implements similar data management features and structures to those in a data warehouse, while on low cost storage used for data lakes.
A lakehouse is exactly what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage options are available. Data lakehouses will give you data diversity, transparency, and high performance.
- Lakehouses offer support for both structured and unstructured data and diverse workloads, such as data science, machine learning, SQL and analytics.
- Lakehouses offer schema enforcement and governance, which means that the system should be able to reason about data integrity, and should have robust governance and auditing mechanisms. It also supports open APIs and end-to-end streaming, offering real-time reports that eliminate the need for separate systems dedicated to serving real-time data applications.
- A lakehouse enables business intelligence (BI) tools directly on the source of data. In addition, lakehouses’ storage is decoupled from compute and therefore the system is able to scale to many more concurrent users and larger data sizes. Decoupling makes it inherently easier for businesses to scale their BI and ML projects because it is not defined or limited by the size of the storage it has.
In this modern enterprise world, tools for security and access control are basic requirements, and in light of recent privacy regulations, data governance capabilities including auditing, retention, and lineage have also become essential. Lakehouses allow all the basic and fundamental features to be simply implemented, tested, and administered for a single system.
From business intelligence to artificial intelligence
In an age when machine learning is poised to disrupt every industry, lakehouses radically accelerate innovation while also simplifying enterprise data infrastructure. Previously, a company’s data for products or decision making was structured from operational systems, but today, many products incorporate artificial intelligence (AI) in the form of text mining, computer vision and speech models, as well as others.
Ultimately, lakehouses give you data versioning, governance, security and ACID properties that are needed even for unstructured data, which makes the data management system a better choice over a data lake for AI.
It must also be said that while current lakehouses reduce cost, their performance can still lag specialised systems, such as data warehouses, that have years of investments and real-world deployments behind them. In addition, users may favour certain tools (i.e. BI tools, IDEs, notebooks) over others, therefore lakehouses will also need to improve their user experience and their connectors to popular tools so they can appeal to a variety of business personas.
These and other issues will be addressed over time as the technology continues to develop and mature. Ultimately, lakehouses will close all gaps while retaining the core properties of being more cost efficient, simpler and more capable of serving various data applications.
What does this mean for the everyday person? Well, for one, better AI-led data services will mean improved customer experiences through the increased ability to offer personalised products and services. But beyond that, being able to process, understand and action the astonishing amount of data we have at hand has the potential to facilitate incredible breakthroughs in a number of fields including science and healthcare that could change the way we understand diseases and even treat patients in the future.
Bharath Gowda, VP of Product Marketing, Databricks