Big data projects and state-of-the-art data science models are using artificial intelligence (AI) and machine learning (ML) to drive innovation across financial services, healthcare, government and other sectors. Take the healthcare industry for example, which is expected to spend roughly $23 billion globally on big data analytics by 2023, according to P&S Intelligence. Medical and life sciences organisations are embarking on AI and ML initiatives to unlock complex data sets with the goal of preventing diseases, speeding recovery and improving patient outcomes. And, financial services institutions are using these systems to bolster fraud detection efficacy, and federal governments are applying it for public data sharing to support R&D and improved public services (and the list goes on and on).
The sensitive nature of the data used in deep learning projects – including data ownership issues and regulatory requirements such as the General Data Protection Regulation (GDPR), HIPAA, financial data privacy rules, etc. – require organisations to go to great lengths to keep information private and secure. As a result, data sets that could be tremendously valuable in concert with other initiatives (or organisations) are often locked away and guarded, creating data silos. But as a variety of industries begin to spread their wings with AI and ML technology, we’re seeing a groundswell of overwhelming demand for innovative, trusted and inclusive solutions to the data collaboration problem. Organisations are asking for a way to execute deep learning algorithms on data sets from multiple parties, while ensuring that the data source is not shared or compromised, and that only the results are shared with approved parties.
A few years back, attempts were made to address this challenge by moving data to the compute mechanism. This approach involved moving data sets from various parties’ edge nodes to a centralised aggregation engine. The data was then run though the aggregation engine at a central location in a Trusted Execution Environment (TEE) – an isolated, and private execution environment within a processor such as Intel SGX – so only the output or results of the query could be shared, while the data themselves were kept private.
This “centralised data aggregation model” led to a new set of challenges. Moving data from one site to another can be a significant burden on an organisation due to the sheer size of a data set, or certain data privacy and storage regulations that simply make it impossible. Additionally, there were many data normalisation challenges that came with this approach. For example, data sets from various healthcare institutions often come in different file formats, with different fields of information that don’t match up with other parties. Without a common schema across all participating data sets, aggregation could be incredibly arduous and even impossible. Lastly, “moving data to the compute” required a tremendous amount of upfront commitment and cooperation from IT personnel at each organisation involved.
The overall goal of this early approach was to address the privacy and security problems that were so prevalent in big data collaboration projects. While it provided some benefits, it turned out to be a less than optimal method. However, it led to a new approach called “Federated Machine Learning.”
Federated Machine Learning is a distributed machine learning approach that enables model training on large bodies of decentralised data, ensuring secure, multi-party collaboration on big data projects without compromising the data of any parties involved. Google first coined the term in a paper published back in 2016, and since then the model has been the subject of active research and investment by Google, Intel and other industry leaders, as well as academia.
In this approach, the data aren’t moved at all. Contrary to previous techniques, compute actually “moves to the data.” So, Federated Machine Learning brings processing mechanisms to the data source for ongoing training and inferencing at the source, instead of requiring that participating organisations migrate data to one centralised location. As a result, processing is done by organisations onsite (or in-network), and the results are sent to a centralised location where the model is simply updated through aggregation.
Protected at the hardware level
Federated Machine Learning addresses some major data collaboration privacy issues, but we’re still left with some questions to answer. For instance, the data might remain private, but is the aggregation model secure from theft or tampering that could lead to data leakage? Or if the model itself is secure, are the communication links between federated nodes and the aggregator secure from interference?
To answer these questions, we have to look at the role hardware technologies play in the Federated Machine Learning process. As big data projects leveraging AI and ML continue to take off, participants must be protected through security layers down to the silicon. These Federated Learning systems deploy hardware-based TEEs at the participants’ edge nodes as well as the central aggregation engine (where the aggregation model resides). This would ensure that the model training at the edge and aggregated model itself are computed inside a trusted environment to protect the confidentiality and integrity of code and data. The communications between edge and aggregation engine would also be protected from tampering. This would remove many of the issues that arise from moving data to compute.
In a Federated Machine Learning model, compute and data would be both protected at the hardware level across the entire system, within TEEs. This helps the parties involved find confidence in the privacy and security of both the dataset and the machine learning model, supporting against confidentiality leaks and data integrity attacks.
As we look ahead, we can expect hardware TEE-enabled Federating Machine Learning to produce major breakthroughs in big data collaboration. Imagine a future where this technology facilitates a trusted, global data-sharing playing field that enables organisations to unlock previously untapped data sets for collaborative analysis with other organisations. Access to large, reliable datasets is essential to the development and deployment of robust and trusted AI/ML solutions across every industry.
There’s no doubt that creating a trusted, decentralised data collaboration model will generate far-reaching benefits, but there’s still a significant amount of work to be done in order to reach widespread commercial adoption. The industry needs technology leaders in computing hardware and blockchain, world governments, regulatory bodies around the globe, standards organisations, public and private participants, and more to collaborate with one another. And as with any machine learning application, access to data will be a key to success. Organisations within many data science disciplines must work together to develop a common schema across various data sets, ensuring the availability of quality data without any bias or issues.
Nikhil M. Deshpande, Director of AI and Security Solutions Engineering in the Data Platforms Group, Intel