In today’s world, enterprises must be agile. As the pandemic demonstrated, they need to be able to change their objectives and goals to support operating processes and decision-making capabilities as quickly as possible.
Businesses need to be able to find and use the analytical data and assets that support the strategic and tactical decisions they have to make daily. But how do they achieve the nirvana of complete and unrestricted access to analytical data? The answer is through a data fabric.
What is a data fabric?
The term ‘data fabric’ was first coined by Forrester analyst Noel Yuhanna in a 2016 report. It has been widely adopted by vendors and other analyst firms in the interim. But while the name might be new, the objective behind it isn’t: an architecture that includes all forms of analytical data for any type of analysis that can be accessed and shared seamlessly across the entire enterprise.
A data fabric provides a better way to handle enterprise data, giving controlled access to data and separating it from the applications that create it. This is designed to give data owners greater control and make it easier to share data with collaborators.
According to Gartner, there are four key pillars in a data fabric architecture:
- These are the best cloud storage solutions on the market right now
- The data fabric must collect and analyze all forms of metadata.
- It must convert passive metadata to active metadata.
- It must create and curate knowledge graphs.
- It must have a robust data integration backbone that supports all types of data users.
What can a data fabric deliver?
The goal of data fabric is for users to be able to quickly and easily access the data they need and analyze it. To achieve this, they need a data catalog function. The data catalog provides a repository for all technical metadata, a business glossary, data dictionary and governance attributes.
It acts as an easy-to-use entry point that employs non-technical language to let users view quickly what data is available and what analytical assets exist (for example, reports, visualizations, advanced predictive and other models).
If the catalog tells them the data they require is not available, they can submit a request to technical personnel to allow that data into the environment. Once they have permission to access the information, they should be able to use it to make decisions, either by creating their analytical asset with the data or through an existing asset that they can tweak to fit their needs as required.
Once the analysis is complete, users should be able to continue to examine data and assets in their area or find other information by returning to the catalog.
How is this achieved?
Making data access and analysis easier for users frequently makes the infrastructure behind it more complicated. From a technical perspective, this means the people who build and maintain the data fabric need to focus on a number of issues. For example, they can avoid the duplication of data and analytical assets by ensuring they know what already exists in the environment.
They need to be able to use the data catalog information to rapidly ascertain if the data being requested exists or not. If it is available, they may only need to update the catalog and notify the user it is there. The data catalog needs to be updated with any additions, edits, or changes made to the data fabric, its data or analytical assets. Data lineage and usage must be continuously monitored.
Data modeling supplies much of the information found in the data catalog, including changes to database design, the existence of data and its location, definitions and other glossary items. It is vital data models are connected to the Business Glossary to ensure a well-managed data catalog.
The data fabric relies on three different analytical components: the Enterprise Data Warehouse (EDW), the Investigative computing platform (ICP) and the real-time analysis (RT) engine. Data integration, extracting data from sources and transforming it into a single version that is loaded into the EDW, is a key component in the creation of analytical data for the EDW. This ETL (extract/transform/load) or ELT (extract/load/transform) process creates the trusted data used in producing reports and analytics.
The advantage of ELT is that data is extracted and loaded into the warehouse directly and transformation logic is applied to the warehouse. Modern warehouses are far more powerful than ETL engines so they can complete the transformation work far more quickly. In addition, ELT is designed to handle all types of data, including unstructured data in data lakes.
For the ICP (or ‘data lake’), raw data is extracted from sources and reformatted, integrated and loaded into the repository for exploration or experimentation. This repository is used for data exploration, data mining, modeling, cause and effect analyses and general, unplanned investigations of data.
Data virtualization is another technology that underpins the data fabric because it removes the requirement to move data physically around the architecture by providing it virtually. The ability to provide access to all data, regardless of its location, is a major step toward what is sometimes referred to as “data democratization”.
Real-time analysis is a relatively new area of analytics focused on analyzing the data streaming into the organization before it is stored. It is a significant addition to the range of analytical components in the data fabric.
Usage statistics found in the data catalog are often created by monitoring technologies in the EDW and ICP. Monitoring who is using data and what data is being used provides an insight into the overall performance of analytical repositories. For example, data that is rarely used can be stored in archive media. Spikes in utilization can be planned for and data frequently used together can be cached or brought together virtually for better performance.
Finally, databases cannot be overlooked as important components of the data fabric environment. Previously, the data warehouse and the investigative area were separated because they used incompatible technologies. It is now possible, with data storage being separated from computing, for the data warehouse and ICP to be deployed on the same storage technology.
- Check out our take on the best cloud hosting services at the moment
Data fabric is a reality
If data fabric is to succeed, the organization needs to maintain the integrity of the architectural standards and components it is built on. If silos are created temporarily as workarounds, they need to be decommissioned when they are no longer needed.
The value of the data fabric depends on the strength of the information gathered in the data catalog. Out-of-date, stale, or inaccurate metadata mustn’t be allowed to leak into the catalog.
Legacy analytic components should be reviewed and redesigned because while deploying them in the data fabric may be convenient, it could cause problems when integrating them into the whole fabric.
These are all issues that can be addressed and overcome. The technologies already exist to build the data fabric and there are a number of suppliers that can provide many of the components that stitch it together. It may only be five years since data fabric came into existence as a term but it is already very much a concrete reality.
- These are the best cloud storage solutions for photos and images
Rob Mellor, VP and GM, WhereScape