How has (open source) data processing evolved and how have the different technologies progressed over time as data processing frameworks have become more sophisticated and the amount, and speed, of data generated has increased by the hour?
Let us try to answer the following two questions: How can we process data, and what are the data processing systems available to us today? Why do we process data? That’s pretty obvious when you consider the huge amount of connected devices, sensors and website visits. Not to mention all of the data generated by humans and machines. It is clear that data processing has been around ever since we invented computers and had access to data.
In the beginning...
The invention of computers created a clear need for information and data processing. During these very early days, computer scientists had to write custom programs for processing data and these were most likely stored on a punch card. The next steps brought assembly language and more purposeful programming languages like Fortran, followed by C and Java. During the prehistoric big data space, software engineers would use these languages to write purpose-built programs for specific data processing tasks. However, this data processing paradigm was only accessible to a select few that had a programming background which prevented wider adoption by data analysts or the wider business community who wanted to process information and make specific decisions.
The next natural step saw the invention of the database, in and around the 1970s. Traditional relational database systems, such as IBM’s databases, enabled SQL and increased the adoption of data processing by wider audiences. SQL is a standardised and expressive query language that reads somewhat like English. It allowed more people access to data processing who thereby no longer had to rely on programmers to write special case-by-case programs and analyse data. SQL also expanded the number and type of applications relevant to data processing such as business applications, analytics on churn rates, average basket size, year-on-year growth figures, etc.
Dawn of big data
The era of Big Data began with the MapReduce paper, released by Google, that explains a simple model based on two primitives - Map and Reduce. These primitives allowed for parallel computations across a massive number of parallel machines. Obviously, parallel computations were possible even before the MapReduce era via multiple computers, supercomputers and MPI systems. However, MapReduce made it available to a wider audience.
Apache Hadoop came next as an open-source implementation of the framework (initially implemented at Yahoo!) that was widely available in the open source space and accessible to a wider audience. Hadoop was adopted by a variety of companies and many Big Data players had their origins within the Hadoop framework. Hadoop brought about a new paradigm in the data processing space: the ability to store data in a distributed file system or storage (such as HDFS for Hadoop) which could then be interrogated / queried at a later point. Hadoop ploughed a similar path to relational databases whereby the first step included custom programming by a specific “cast” of individuals who were able to write programs to then implement SQL queries on data in a distributed file system, such as Hive or other storage frameworks.
Batch processing gets ramped up
The next step in Big Data saw the introduction of Apache Spark. Spark allowed additional parallelisation and brought batch processing to the next level. As mentioned earlier, batch processing includes putting data into a storage system that you then schedule computations on. The main concept here is that your data sits somewhere while you periodically (daily, weekly, hourly) run computations to glean results based on past information. These computations don’t run continuously and have a start point and an endpoint. As a result, you have to re-run them on an ongoing basis for up-to-date results.
From Big Data to Fast Data - the advent of stream processing
This next step in the Big Data evolution saw the introduction of stream processing with Apache Storm being the first widely used framework (there were other research systems and frameworks at the same time but Storm was the one to see increased adoption). This framework enabled programs to be written that could run continuously (24/7). Contrary to the batch processing approach where programs and applications have a beginning and an end, with stream processing programs run continuously on data and produce outcomes in real-time, while the data is generated. Stream processing was further advanced with the introduction of Apache Kafka (originated with LinkedIn) as a storage mechanism for a stream of messages. Kafka acted as a buffer between data sources and the processing system (like Apache Storm).
Lambda Architecture created a slight detour in the story of Big Data. This architecture originated because the initial adopters of stream processing did not believe that stream processing systems like Apache Storm were reliable enough; therefore they kept both systems (batch and stream processing) running simultaneously. Lambda Architecture was a combination of both systems - a stream processing system like Apache Storm was utilised for real-time insights but then the architecture periodically used a batch processing system that maintained the ground truth of what had happened.
Apache Flink - stream processing becomes accessible
Around 2015, Apache Flink started becoming a prominent stream processing framework adopted by developers and data / analytics leaders. Right from the start, Flink exhibited very strong guarantees; exactly-once semantics and a fault-tolerant processing engine that made users believe that the Lambda architecture was not necessary anymore and that stream processing could be trusted for complex event processing and continuous-running, mission-critical applications. All the overhead that came with building and maintaining two systems (batch / stream processing) became redundant due to Flink’s reliable and accessible data processing framework.
Stream processing introduced a new paradigm and shift in mindset from a request-response stance, where data is stored before potential fraud case interrogation to one where you ask questions first and then get the information in real-time as the data is generated. For example, with stream processing you can develop a fraud detection program that is running 24/7. It gets events in real time and gives you insight when there is credit card fraud, preventing it from actually happening in the first place. This is potentially one of the bigger shifts in data processing because it allows real-time insights into what is happening in the world.
The evolution of open source data processing has experienced a common pattern; a new framework is introduced to the market (i.e. a relational database, batch processing, stream processing) that is initially available to a specific audience (programmers) who can write custom programs to process data. Then comes the introduction of SQL in the framework that makes it widely accessible to audiences that don’t need to write programs for sophisticated data processing. Stream processing follows a similar pattern; SQL for stream processing experiences a wide adoption in streaming applications which validates the pattern we experienced in the past. The stream processing market is expected to grow exponentially in the coming years at a CAGR of 21.6 per cent. With this growth and the number of stream processing applications and use cases exploding by the day, the developments in this space are numerous and the future of stream processing an ever-changing and evolving environment.
Aljoscha Krettek, Co-founder & Engineering Lead, Veverica
Image source: Shutterstock/alexskopje