The traditional data transformation procedure of extract, transform and load (ETL) is rapidly being turned on its head in a modern twist enabled by cloud technologies.
The Cloud’s lower costs, its flexibility and scalability, and the huge processing capability of cloud data warehouses, have driven a major change: the ability to load all data into the cloud, before transforming it. This trend means that ETL itself has been transformed—into extract, load and transform, or ELT.
ELT offers several advantages, including retention of data granularity, reduced need for expensive software engineers and significantly reduced project turnaround times.
Data is vital for organizations, who use it to understand their customers, identify new opportunities and support decision-makers with mission-critical and up-to-date information. However, to analyze data, it must first be structured. It needs to be understood so that it can be pulled into dashboards, reports and predictive models.
The problem is that raw data doesn’t present as beautifully formatted, usable information. That is where data transformation comes in. Messy raw data needs to be transformed into representations of reality that help users accomplish specific goals.
This transformation can take place either before the data is loaded to its destination, usually a data warehouse, or afterwards
In traditional ETL, data is transformed into analysis-ready data models before it is loaded. As Charles Wang of Fivetran notes, “combining transformation with loading into the same step can preserve storage and compute resources, but introduces a great deal of brittleness into the data engineering workflow. This also means that the software used for transformations is typically written using scripting languages such as Python and Java. In addition, transformations in ETL may require a great deal of complex orchestration using tools such as Airflow.”
ETL often also entails a great deal of customized code. One of the main challenges of traditional ETL is therefore accessibility. Scarce, costly resources such as engineers and data scientists need to be involved.
Another issue concerns turnaround times. Traditional ETL procedures associated with on-premise data warehouses are usually extremely time-consuming. Using ETL also involves constant maintenance and can introduce complexity.
Modern approaches to transformation:
Storage has traditionally been prohibitively expensive. The benefit of ETL for organizations was that they did not have to load all their data to the final destination. That has now been changed by cloud technologies. We are seeing a massive increase in cloud adoption in South Africa and the costs of technology are decreasing significantly. Lower costs make it possible for organizations to load all their data to the cloud, without having to be as mindful of storage costs.
This means that in the modern ELT workflow, raw data is transformed into analysis-ready data models after it has been loaded. Once in the warehouse, data can be transformed using SQL, which, thanks to its intuitive English-based syntax, can be used by a far wider range of users. Transformation can therefore be done by SQL-literate members of the organization and not only by those with coding knowledge.
Data transformation today thus leverages cloud-based tools and technologies. These collectively make up what is referred to as the modern data stack (MDS).
Central to this MDS is a powerful cloud data platform, usually a cloud warehouse which can also include data lakes. Data is loaded into it from a variety of source systems including databases, web applications and APIs. To do this, a reliable transformation layer is used to convert raw data into query-ready datasets. Lastly, a collaborative business intelligence and visualization solution enables the business to interact with the data and draw actionable insights to guide business decisions.
In his article called Data Transformation Explained, Wang points out that the MDS funnels data through the following stages:
- Sources – data from operational databases, SaaS apps, event tracking
- Data pipeline – extracts data from sources and loads it into the data warehouse, sometimes normalizing it
- Data warehouse – stores data in a relational database optimized for analytics
- Data transformation tool – an SQL-based tool that uses data from the source to create new data models within the data warehouse
- Analytics tool – tools for generating reports and visualizations, such as business intelligence platforms
Transformation within the data warehouse:
Transformations are tailored to produce the specific data models organizations need for analytics. Modern ELT separates extraction and loading from transformation. This makes it possible for companies to automate and outsource the extraction and loading stages of the data integration process. They can then use a dedicated SQL-based transformation tool once the data is already in the warehouse.
A major advantage of ELT is that data essentially remains in granular form because it has not undergone major transformation prior to being loaded. With traditional ETL, an organization may have aggregated specific data before loading, thus losing its original granularity completely.
The new ELT architecture also offers significant performance, flexibility, and cost advantages. Loading is fast, and organizations can save all their data in the data warehouse, even that which they may not currently need.
“Roughly speaking, transformed data models within the data warehouse can be views or materialized views,” notes Wang. He goes on to explain that every time someone accesses a view, the data warehouse runs a query to return the relevant data. These views are not stored. “In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views,” he adds.
By contrast, materialized views are stored on disk because views generated on the fly from a large table or complex query can cause data warehouses to choke.
ELT should probably be referred to as EtLT in most cases, as some light-duty transformation, or normalization, is often carried out before the data is loaded. This removes redundancies, duplicates and derived values. It also organizes tables from the data into the clearest possible set of interrelations so that analysts can easily interpret the underlying data model of the source app, and construct new analysis-ready data models accordingly.
“The outputs of the extraction and loading pipelines must be standardized if outsourced, automated ELT is to work,” says Wang. “To correctly normalize the data from a source, you need a keen understanding of the source application’s underlying functionality and data model. The best way to circumvent this challenge is to outsource extraction and loading to a team that has extensive experience with data engineering for that particular source.”
- Check out our take on the best cloud hosting services at the moment
IT Experts, Keyrus