The rise of graph data science as part of a data scientist’s toolbox will be central to the next decade. In its June ‘Top 10 Data and Analytics Technology Trends for 2020’ report, Gartner states, “Finding relationships in combinations of diverse data, using graph techniques at scale, will form the foundation of modern data and analytics.”
A few months after it released that statement, Gartner surveyed companies about using AI and ML techniques. There, a remarkably high 92 percent said they plan to employ graph techniques within five years. There is also a surge of academic research focused on this field, with over 28,000 peer-reviewed scientific papers about graph-powered data science published in recent years.
The pace is accelerating. It’s a clear exponential curve and the research community around machine learning has graph data science as a top area of focus.
Democratizing state-of-the-science techniques
Graph data science has a long history, of course. It was envisioned in the 1700s by Leonhard Euler who invented graph theory as a discipline in mathematics. More recently, Google used graph-based page rank to revolutionize search engines, which was the first time it got used at scale in the software industry.
A quarter of a century later, and graph data science is no longer something only companies such as Google have the AI expertise and resources to use. This powerful and innovative technique can reason about the ‘shape’ of the connected context for each piece of data through graph algorithms and embeddings, enabling far superior and richer machine learning predictions.
Graph data science democratizes these innovations to upend the way enterprises make predictions in many diverse scenarios, from fraud detection to tracking customer or patient journey, to drug discovery and knowledge graph completion. In a drug discovery use case, this means not only identifying possible new associations between genes, diseases, drugs and proteins, but also providing immediate context to assess the relevance or validity of these discoveries. For customer recommendations, it means learning from user journeys to predict accurate recommendations for future purchases, while presenting options within their buying history to build confidence in suggestions.
The ability to rapidly ‘learn’ generalized, predictive features from data is significant. Organizations don’t always know how to leverage connected data for use in machine learning models. Knowledge graphs provide value across domains, including identifying new associations between genes and diseases, discovering new drugs and predicting links between customers and products for better recommendations. Increasingly, data scientists are acknowledging that, from queries to support domain experts in uncovering patterns to the identification of high-value features to train ML models, a lot of their work isn’t really possible without graph technology.
Graph data science use cases
To take one instance, graphs are being deployed at the top of British government. In a recent GOV.UK blog post, One Graph to rule them all, Whitehall data scientists Felisia Loukou and Dr. Matthew Gregory discuss deploying their first machine learning model with the help of graph technology. The resulting model automatically recommends content to GOV.UK users, based upon the page they are visiting. They explain that their application, given any graph, learns continuous feature representations (a list of numbers) for the nodes, which can then be used for various machine learning tasks, such as recommending content. The government data scientists note, “Through this process, we learned that creating the necessary data infrastructure which underpins the training and deployment of a model is the most time-consuming part.”
Ben Squire, senior data scientist at leading media and marketing services company Meredith, shared his experience with graph data science work, stating that the use of graph algorithms is allowing the transformation of literally billions of page views into millions of pseudonymous identifiers with rich browsing profiles: “Providing relevant content to online users, even those who don’t authenticate, is essential to our business,” he points out. “Instead of ‘advertising in the dark,’ we now better understand our customers, which translates into significant revenue gains and better-served consumers.”
Likewise, the world's leading manufacturer of construction and mining equipment, Caterpillar, is using graph data science to make natural language processing of a large-scale repository of technical documents detailing repairs more effective. The problem was that there was a lot of disparate data to connect. The company recognized there was valuable data housed in more than 27 million documents and set about creating a natural language processing tool to uncover these unseen connections and trends. The resulting graph-based machine learning classification tool learns from the portion of data already tagged with terms such as ‘cause’ or ‘complaint’ to apply to the rest of the data. The resulting system uses WordNet as a lexicographic dictionary to provide definitions for the words, plus accesses the Stanford Dependency Parser to parse the text and graphs to find patterns and connections, build hierarchies and add ontologies. Once this is all put together, users can conduct meaningful, data science-enhanced searches.
Another example is how NewYork–Presbyterian Hospital's analytics team is using graph data science to better track infections and take strategic action to contain them. The team says they chose this approach as it offers a flexible way to connect all the dimensions of an event – the what, when and where it happened. Effectively, NewYork–Presbyterian Hospital wanted to log every event, from the time a patient was admitted to all of the tests they undergo and their eventual release. The team created a ‘time’ and then a ‘space’ tree to model all the rooms patients could be treated in on-site. This initial model revealed a large number of inter-relationships, but that alone did not meet their goals. An event entity was included to connect the time and location trees. The resulting data model means the analytics team is able to analyze everything that happens in its facilities. The graph dataset was fed into a community detection graph algorithm, which grouped events into various specialties such as oncology and pediatrics – validating all the modelling work.
Truly predictive AI
Use cases like the British government, Meredith, Caterpillar and the NewYork–Presbyterian Hospital are the tip of the graph data science iceberg. Gartner thinks that inside three years, a quarter of global Fortune 1000 companies will have built a skills base and will be leveraging graph technologies as part of their data and analytics initiatives.
Graph algorithms will improve your AI and machine learning initiatives. Enterprises need to investigate how to incorporate graph analytics into their analytics portfolios. The potential of graph-powered data science for truly predictive AI is huge.
Dr Alicia Frame, Lead Product Manager – Data Science, Neo4j