Skip to main content

The impact of Covid-19 on Big Data

(Image credit: Image Credit: Montri Nipitvittaya / Shutterstock)

Big Data has been touted as a potential panacea to the global pandemic, Covid-19. But the technology needs to evolve to meet the demands of this crisis.

Big data is unstructured, arriving in tremendous volume, variety and velocity from a variety of heterogeneous and inconsistent sources. And while extract transform load (ETL) processes are used for structuring and warehousing the data in a way that enables meaningful modelling and analysis, tools such as Spark and Hadoop require specialist engineers to manually tune various aspects of the pipeline - a slow and costly process.

Moreover, solving the problem of modelling and analysis through ETL pipelines requires the use of data science, machine learning and scientific computing, which are extremely performance intensive. The solutions typically revolve around supercomputing or high-performance computing (HPC) approaches.

Big data and HPC in the age of cloud computing

The first generation of cloud where big data began was about throwing cheap commodity hardware at a large data problem. The applications tended not to be very computationally-intensive (processor-bound), but rather data-intensive (disk/memory bound). Interest in making optimal use of processor and interconnect was at best a second-class concern.

Although the big data ecosystem has since made inroads into performance-based computing, limitations remain in the technological approach. They tend to be Java based and lack bare metal performance - as well as the predictable execution that is required to make performance guarantees in a large system.

Approaches such as MPI were built in an era where the resources of a given supercomputer were known ahead of time, and were time-shared. The supercomputer was in-demand for a pipeline of highly tuned and specialised problems to be serviced over its lifetime. Algorithms were carefully tuned to make optimal use of the available hardware.

Big data technologies are designed to take a more genericised approach, not requiring careful optimisation on the hardware, but they still remain complex and require teams with specialist skills to build a specific set of algorithms at a specific scale. Scaling beyond a given implementation, or adding additional algorithmic capability, requires further reengineering and projects can take several years. The infrastructure costs become massive.

Rethinking the computing model

The inexorable future of computing is the cloud, and its evolutionary manifestations: edge computing, high-speed interconnect and low-latency/high-bandwidth communications. Powerful and capable hardware will be made on demand, applications will run the gamut from big data/small compute to small/data big compute and, inevitably, big data/big compute.

Therefore, a more effective approach to building large-scale systems is through an accessible HPC-like technology that is designed from first principles and capable of harnessing the cloud. The cloud offers the benefit of on-demand availability and the ever improving processes and interconnect.

However, such a landscape requires a radical rethink in order to unlock and exploit the true power of computing. Truly harnessing the power of the cloud, requires a scale invariant model for computing, which can build algorithms and run them at an arbitrary scale, whether on a process axis (compute) or memory axis (data).

The opportunity lies in building a model that allows programs to be distribution and location agnostic. Applications that dynamically scale based on runtime demand, whether to handle the vast influx of data in real-time or to crunch enormous matrices and tensors to unlock some critical insight.

It ensures a developer can write algorithms without worrying about scaling, infrastructure or devops concerns. Just as today a programmer, a scientist or machine learning expert can build a small data/small compute model on a laptop, they will equally run that model at arbitrary scale on a data centre without the impediments of team-size, manual effort, and time. The net result is users ship faster, in smaller teams, and at lower cost. Moreover the need for a national supercomputer is diminished further, as engineers are able to deal with massive datasets, and crunch them with the most compute-intensive algorithms demanded all on the democratised hardware of the cloud.

Applying big data applications to Covid-19

The impact of Covid-19 has drawn international attention to the role technology can play in understanding its spread, impact and the mitigating steps we can take.

There are currently a number of models and simulations being used to address the impact of virus transmission, whether it's the spread from person-to-person, how virus the transmits within an individual, or a combination of the two. However, real time simulation, and even non-real time but massive simulation, is an incredibly complicated compute problem. The big data ecosystem is not remotely equal to the task. The solution requires not just a supercomputing approach to solve this problem, but also must solve the dynamic scalability problem - which is not the province of supercomputers.

It requires a platform that is both big data capable and big compute capable. It must leverage the cloud to scale dynamically using only the resources it requires at any given instant in time, as well as using all the resources it requires at any instant when the need arises. The development of these technologies is now being expedited as developing the infrastructure to develop accurate models that use vast data sets, combined with the physiology and genomic of individuals has become a global priority.

In turn the technology will usher in an era where drug therapies will be specifically optimised to the individual. A personalised approach to healthcare will enable a rigorously scientific approach not just to the eradication of illness but optimise our wellbeing and happiness. Although we need to see the impact of these developments before racing to conclusions, as we track our lives and health with richer data than ever before, we will discover things about health, wellbeing and longevity that seem inconceivable today.

Rashid Mansoor, Chief Technology Officer & Co-Founder, Hadean (opens in new tab)

Rashid is the Chief Technology Officer and Co-Founder of Hadean, a deep-tech distributed computing company based in London, UK. A member of the inaugural Entrepreneur First cohort, he previously founded VC-based data intelligence startup, Adbrain. Today, Rash is the driving force behind the technical innovation at Hadean.