There have been significant developments within the cloud vendor ecosystem in recent years which have fundamentally changed how companies buy, deploy, and run data systems and the applications they enable. As companies continue to leverage the public cloud deployment option, cloud vendors have absorbed more data ingest, back-end storage, and transformation technologies into their core offerings, and are now highlighting their analytics, data pipeline and modelling tools in order to offer a more comprehensive, end to end portfolio to their customers.
What that means is that this is great news for those who are deploying, migrating, or upgrading data systems, as they can now focus on generating business value from their data, rather than having to funnel manpower to the support of hardware and infrastructure. The data engineering side of the big data programme is complex, tough, and requires very skilled data engineers to manage the problem solving and optimisation.
As technology becomes simpler and more straightforward to deploy, in tandem with advancements from cloud vendor data services, it is easier than ever to build data-centric applications and provide optimised data tools for the enterprise. But with the market awash with fierce competition and a rapid pace of innovation from vendors, which is the right combination of services as you build your data pipelines? It can be a bewildering array of choices.
Even before a company migrates its current on-premises data, it must consider how to go about determining what cloud services it’ll need, what resources are required and what are all the dependencies to ensure applications meet SLAs and cost targets.
It can be a confusing and daunting challenge, but in order to provide clarity amidst the chaos, it makes sense to break this down into the individual stages of the cloud lifecycle, and then match these up to the right services. There are five stages that sit between data sources and data consumers: Capture, store, transform, publish, and consume.
Azure, AWS and Google Cloud Platform (GCP) have a very comprehensive portfolio of cloud offerings based on their core networking, storage, compute and application services. In addition, they provide vertical offerings for many markets, and within the big data systems and ML/AI categories they each provide multiple offerings too. For ease, here are the five key stages to consider in the big data lifecycle when choosing which cloud provider and service to work with:
The first step in any big data system - capture - includes ingestion of both batch and streaming data. Cloud vendors provide many tools for bringing large batches of data into their platforms; this can include database migration or replication, processing of transactional changes, and physical transfer devices when data volumes are too big to send efficiently over the internet. Batch data transfer is the preferred method for moving on-premise data sources and shifting data from internal business applications - but streaming technologies are being rapidly deployed to support real-time data applications.
This stage of the big data lifecycle focuses on the concept of a data lake, a single location where structured, semi-structured and unstructured data and objects are all stored together. In order to increase data access and analytics performance, it is important for the data to be highly aggregated within the data lake and placed into high-performance data warehouses or similar large-scale databases. Cloud vendors have recently shined a spotlight on the concept of the data lake, by adding functionality to their object stores and creating much tighter integration with transform and consume service offerings.
Transform refers to the stage where value is created, as insight is derived from the big data sets. In response to this vital stage within the data lifecycle, cloud vendors have begun to provide transformative solutions. It can be difficult to decide on which tool to use, and all three of the big cloud vendors have versions of Spark/Hadoop that scale on their Infrastructure-as-a-Service (IaaS) compute nodes. However, all three now provide serverless offerings that make it much simpler to build and deploy data pipelines for batch, machine learning and streaming workflows.
In addition, they now all provide end-to-end tools to build, train, and deploy machine learning models quickly. This includes data preparation, algorithm development, model training algorithms, and deployment tuning and optimisation.
Once through the initial stages of the lifecycle, it is necessary to publish a quality output for users and applications to consume. This often comes in the form of data warehouses, data catalogues or real-time stores. These warehouse solutions are abundant in the market and the choice depends on the data scale and complexity as well as performance requirements. Cloud vendor examples include Amazon Web Services (AWS) Redshift, Google BigQuery, and Azure SQL Data Warehouse.
One thing to note is that cloud vendor data catalogue offerings are still relatively immature and many companies have started to build their own or use third-party catalogues.
The true value of any big data system comes together in the hands of the technical and non-technical consumers using data-centric applications and products. There are three principal models of this stage to consider when evaluating cloud services: Advanced analytics, business intelligence (BI) and real-time application programming interfaces (APIs).
Advanced analytics users consume both raw and processed data either directly from the data lake or from a data warehouse and use similar tools from the transform stage such as Spark and Hadoop-based distributed compute. Whereas BI tools have been optimised to work with larger data sets and directly in the cloud, each of the three cloud vendors now provide BI tools optimised to work within their stack. Applications, products, and services also consume raw and transformed data through APIs built on the real-time store or predictive ML models.
Wrapping the cloud up neatly
Wherever a business is on its cloud adoption and workload migration journey, now is the time to start or accelerate your strategic thinking and execution planning for cloud-based data services. It’s a buyers market. However, as migration goes from planning to reality, it is vital to ensure that investment is made in the critical skills, technology and process changes to establish a data operations centre of excellence. The high frequency of technology innovation, the explosion in data volumes, and the complexity of managing multiple cloud services mandates a unified approach to managing your data pipelines end-to-end with enough flexibility to accommodate constant change and increasing scale.
As digital transformation accelerates, cloud services play a critical role in enabling an agile and competitive edge in meeting business requirements. It is critical for a business to align its stage in the cloud journey to understand what services it’ll need and when to ensure it future-proofs its architectural and operations decisions. After all, it’s what the business needs, and it’s key to high business performance.
Kunal Agarwal CEO and co-founder, Unravel Data