Skip to main content

How data warehouses, data lakes and data hubs differ in focus and work better together

(Image credit: Future)

Data continues to grow more diverse and more distributed — as do the sources of data and points of data consumption. At the same time, analytical needs and operational uses of data are proliferating across the enterprise and beyond. Stakeholder needs can no longer be met by traditional architectures that are based on centrally collecting data and enable predefined uses. Data and analytics leaders and their teams need to deliver a modern data management infrastructure that supports flexibility, diversity of data needs and connectedness.

This requires a combination of different data organization and processing approaches. However, some data and analytics teams are still focused on meeting all needs using a single architectural pattern — either a traditional enterprise data warehouse, a modern data lake or a data hub.

There is significant confusion between these concepts. Many organizations will use these terms interchangeably or will use the same term to mean different things in different scenarios. For example, while Gartner client inquiries referring to data hubs increased by 20 percent from 2018 through 2019, more than 25 percent of these inquiries were actually about data lake concepts. This suggests that there is confusion about or misuse of the terminology.

There is also a lack of clarity about the roles of data warehouses and data lakes. An estimated 30 percent of clients posing data lake inquiries are either considering a data lake as a replacement for a data warehouse or are otherwise unclear about the relationship between data lakes and data warehouses.

All three of these architectural patterns (data warehouses, data lakes and data hubs) are key areas of investment. However, there is a need for greater clarity and focus. Data and analytics leaders must understand the purpose of these three types of structures and the role they can play in a modern data management infrastructure.

Tale of three: Data warehouses versus data lakes versus data hubs

As a result, data and analytics teams should think of data warehouses and data lakes as similar types of structures. Their primary purpose is to support analytics (albeit of different styles). In contrast, data and analytics leaders should think of data hubs as more operational structures, focused on enabling data sharing and governance.

Data warehouses store well-known and structured data. They support predefined and repeatable analytics needs that can be scaled across many users in the organization. Data warehouses are suited to complex queries, high levels of concurrent access and stringent performance requirements.

Data lakes collect unrefined data (that is, data in its native form, with limited transformation and quality assurance) and events captured from a diverse array of source systems. Data lakes usually support data preparation, exploratory analysis and data science activities.

Data warehouses and data lakes are similar. Both provide an endpoint for collection of transactional, detailed data (and possibly other types of data) specifically to support the execution of analytical workloads. This means that various kinds of analytics can be run atop them, accessing the data they hold to support analytic processing. As a result, both data warehouses and data lakes have a common focus — supporting the analytics needs of the enterprise. While data warehouses and data lakes may also include governance controls (for example, they can provide monitoring and resolution of quality issues in inbound data), they support governance in a more reactive and “downstream” manner.

Data hubs are quite different because they generally do not store detailed data for extended periods. Also, data hubs are not repositories on which analytic workloads are generally executed. Rather, they are points of mediation and data sharing. Data hubs enable data flow in the enterprise by connecting producing systems and processes with consuming systems and processes. For example, a data hub can be used to connect business applications to a data warehouse or a data lake. They also proactively apply governance controls to the data flowing across the infrastructure.

The three structures are best used in combination

It is important to recognize that these architectural patterns can bring more value to the enterprise when used in combination. Data and analytics leaders should not simply choose between either a data warehouse, a data lake or a data hub.

Instead, they should consider combinations of these structures to support the full range of current and anticipated requirements. The data warehouse, data lake and data hub can be combined to work together in an effective architecture.

Common patterns involving combinations of these structures continue to emerge. For example:

  • A "hub-centric" architecture, where data lakes, data warehouses, operational systems and other data producers and consumers are all endpoints connected to a data hub. The data hub is the conduit through which all data is shared among, and provisioned to, any type of consumption points.
  • A collect-centric pipeline architecture for analytics, where operational data is delivered into a data hub, from which it is provisioned into a data lake. From the data lake, refined data and the results of data discovery activities might then be loaded to the data warehouse for structured and repeatable access by a more diverse set of constituents.
  • Edge-centric scenarios are also emerging, driven by IoT use cases. Endpoint devices send data to a gateway, which acts like a data hub, for operational use. From the hub, data makes its way into the lake for refinement, and finally into the warehouse.

A key element of modern data management infrastructure is the ability to be dynamic — to evolve architectural patterns over time, enable new connections and support new use cases.

Data and analytics teams should regularly review requirements to decide how to evolve. For example, potentially adding new endpoints to existing hub environments, creating new data hubs as new collections of endpoints with data sharing requirements emerge, or shifting the relationship between data warehouses and data lakes to optimize the logical data warehouse.

In addition, given the dynamic and distributed nature of these patterns, metadata capabilities to express and guide the connections and data flow between the structures become critical to success.

Ted Friedman, Research Vice President, Gartner