Skip to main content

Building a common data platform for the enterprise on Apache Hadoop

To become a data-driven enterprise, organisations must process all types of data, whether it be structured transactions or unstructured file server data such as social, IoT or machine data. Competitive advantage is at stake, and companies failing to evolve into data-driven organisations risk serious business disruption from competitors and startups.

Fortunately, we live in a time of unprecedented innovation in enterprise software and enterprise data has finally become manageable on a large scale. Thanks to the Apache Hadoop open source framework delivering enterprise archives, data lakes and advanced analytics applications, enterprise data management solutions are now able to turn the tide on data growth challenges.

Enter the Common Data Platform (CDP): a uniform data collection system for structured and unstructured data featuring low-cost data storage and advanced analytics. In this article, I’m going to define the components of a CDP, and where it stands alongside the traditional enterprise data warehouse.

1. Apache Hadoop

Apache Hadoop is the backbone of the CDP. Hadoop is an open-source data management system that distributes and processes large amounts of data in parallel (across multiple servers and distributed nodes). It’s engineered with scalability and efficiency in mind, and designed to run on low-cost commodity hardware. Using the Hadoop Distributed File System (HDFS), Hive and MapReduce or Spark programming model, Apache Hadoop is able to service most any enterprise workload.

Hadoop supports any data whether structured or unstructured in many different formats making it ideal as a uniform data collection system across the enterprise. By denormalising data into an Enterprise Business Record (EBR), all enterprise data may be text searched and processed through queries and reports. Unstructured data from file servers, email systems, machine logs and social sources is easily ingested and retrieved as well.

2. Data lake

A Hadoop data lake functions as a central repository for data. Data is either transformed as required prior to ingestion or stored “as is,” eliminating the need for heavy extract, transform and load (ETL) processes. Data needed to drive the enterprise may be queried, text searched or staged for further processing by downstream NOSQL analytics or applications and systems.  

Data lakes also significantly reduce the high cost of interface management and data conversion between production systems. Data conversion and interface management may be centralised with a data lake deployed as a data hub to decouple customisations and point to point interfaces from production systems.

3. Information governance

Information governance defines how data is managed and accessed throughout its lifecycle and is an essential component to any enterprise data management strategy whether or not you are using a CDP.

Information Lifecycle Management (ILM) provides the necessary data governance control framework to meet risk and compliance objectives, and ensures that best practices for data retention and classification are deployed. ILM policies and business rules may be pre-configured to meet industry standard compliance objectives or custom designed to meet more specific requirements.

By establishing a data management policy for all enterprise data from creation thru final deletion, ILM establishes a control framework to manage risk and compliance for all enterprise data throughout its lifecycle.

4. Enterprise archive

Enterprise archiving improves the performance of productions systems by relieving the burden of handling too much data and distributes your organisation’s data into tiers based on age to more efficiently manage infrastructure performance and costs.

The four tiers for enterprise archiving are the production tier (for active data), partition tier and database archive tier (for semi active data) and the Hadoop tier (for less active or inactive data). By moving less frequently accessed data from production infrastructure to lower cost commodity hardware platforms, dramatic savings are possible.

All archive tiers are designed to retain native access to data, so that it can be easily queried or text searched.

 CDP versus Data Warehouse

What’s the difference between a Common Data Platform and the traditional data warehouse? Is CDP meant to be a replacement?

First, the differences. While a CDP stores all types of data: structured, semi-structured, unstructured and raw, a data warehouse typically only stores structured and processed data. Moreover, data warehouse architectures suffer from the limitations of canonical, top-down schema design which describes the data. Data driven organisations require more specifically defined data to service any number of processing requirements established by next generation analytics or other production applications.

Storage-wise, a CDP is designed for low-cost cloud architected storage running on commodity hardware versus traditional enterprise data warehouse platforms which typically run either in memory or on high performance (and high cost) storage arrays. In simpler terms, think of the data warehouse as a secure production environment where data may be easily and effectively interpreted by traditional users, but all too often, not by others who may require more specific views required by the data driven enterprise.

As a low cost, bulk data storage solution, CDP not only stores data less expensively, it offers an ideal development or staging environment to feed downstream systems for processing, structuring and analysis. Designed for different purposes, CDPs and data warehouses aren’t meant to replace each other; rather, they are meant to complement one another based on the types of data involved. 

 Why Common Data Platform? 

A Common Data Platform built on Apache Hadoop is a highly scalable, efficient and low-cost foundation for managing data on a petabyte scale either on premise or in the cloud.

For comparison, a high performance database machine with Extreme Flash (EF) for in memory processing may cost $50,000 per terabyte. Meanwhile, the same amount of cloud storage costs a little over $20 per month, and Apache Hadoop is free and open source. Overall, CDP is able to store and manage data many times cheaper than on a production tier.

Enterprises that already have a data warehouse should consider complementing their production environments with a CDP for enterprise archiving, data lake and advanced analytics - all vital to support a data driven enterprise.

John Ottman, Executive Chairman of Solix Technologies, Inc.