Today’s cloud applications are intensely multi-faceted where data management is concerned. The data flowing through these applications is complex, ever-changing, large in volume, and highly connected.
The number of data relationships coupled with the data distribution, scale, performance, volume, and uptime requirements of the application are not a fit for a relational database. However, these requirements are addressed natively by a graph database that possesses scale-out and active-everywhere capabilities.
A graph database is used for storing, managing and querying data that is complex and highly connected. Unlike most other ways of representing data, graphs are foundationally designed to express relatedness.
A graph database’s design makes it particularly well suited for exploring data to find commonalities and anomalies among large data volumes and unlocking the value contained in the data’s relationships. This allows graph databases to uncover patterns that are difficult to detect when using traditional databases.
Figure 1 – A simple graph data model.
Because of its focus on data relationships, it’s natural to wonder how a graph database differs from other popular database technologies and when a graph database should be used.
Comparing Graph with an RDBMS
An RDBMS (Relational Database Management System) and graph database are similar in that they involve data that contains connections or relationships between data elements. From a data model perspective, their components have the following surface level similarities:
|An identifiable “something” or object to keep track of||Entity||Vertex|
|A connection or reference between two objects||Relationship||Edge|
|A characteristic of an object||Attribute||Property|
Foundationally an RDBMS and graph database differ in the underlying engine each uses to store and access data. However, the primary difference between a graph database and an RDBMS is how relationships between entities/vertexes are prioritised and managed. While an RDBMS uses mechanisms like foreign keys to connect entities in a secondary fashion, edges (the relationships) in a graph database are of first order importance.
In other words, relationships are explicitly embedded in a graph data model. Essentially, a graph-shaped business problem is one in which the concern is with the relationships (edges) among entities (vertexes) than with the entities in isolation. One indicator that a graph database is a better choice than an RDBMS for a target use case is consistently seeing large and non-performant SQL JOIN queries being needed to satisfy application queries.
The following comparison grid can be used to help in the decision making process of whether to use an RDBMS or a scale-out, real-time graph database like DataStax Enterprise (DSE) Graph for a particular use case:
|Simple to moderate data complexity||Heavy data complexity|
|Hundreds of potential relationships||Hundreds of thousands to millions or billions of potential relationships|
|Moderate JOIN operations with good performance||Heavy to extreme JOIN operations required|
|Infrequent to no data model changes||Constantly changing and evolving data model|
|Static to semi-static data changes||Dynamic and constantly changing data|
|Primarily structured data||Structured and unstructured data|
|Nested or complex transactions||Simple transactions|
|Always strongly consistent||Tunable consistency (eventual to strong)|
|Moderate incoming data velocity||High incoming data velocity (e.g. IoT)|
|High availability (handled with failover)||Continuous availability (no downtime)|
|Centralised application that is location dependent (e.g. single location), especially for write operations and not just read||Distributed application that is location independent (multiple locations involving multiple data centres and/or clouds) for write and read operations|
|Scale up for increased performance||Scale out for increased performance|
From a language and interface perspective, as SQL is to an RDBMS, Gremlin is to a graph database. Gremlin is the open source standard language for all graph databases and is part of the Apache TinkerPop™ graph framework. While different than SQL, Gremlin is an extremely flexible, expressive, and easy-to-learn language that supports both transactional and analytical operations.
Comparing Graph with Other NoSQL Databases
Figure 2 – The data model continuum represented by complexity and data connectedness
For example, the following comparison grid can be used to help determine when a tabular data model, such as the one found in Apache Cassandra, should be used versus a graph data model:
|Cassandra Tabular Data Model||Graph Model|
|Little to no value in data object relationships||Great value in data object relationships|
|Manual data denormalisation easy||Manual data denormalisation too complex|
|Data rarely joined together. If joins occur (e.g. with Spark, etc.), performance is acceptable||Data constantly connected and used to produce end result in performant manner|
|Write/read heavy||Read heavy; write moderate|
The data requirements of cloud applications and limitations in existing databases usually leaves IT organisations with no choice but to try and create cobbled-together architectures that consist of multiple database technologies. Inevitably, these not only end up failing to fully meet their application’s requirements, but also prove to be difficult to develop against, administer, and are cost prohibitive.
The solution to this problem is to utilise a data platform capable of supporting adaptive data management - oftentimes referred to as a “multi-model database” - which allows an architect to map the multi-faceted data requirements of their application to the most appropriate data model (e.g. graph, tabular, document, etc.) and have everything persisted to a single datastore.
Such capability makes it easy to use graph technology alongside other data models and achieve more of a hand-in-glove fit for their distributed cloud application.
Robin Schumacher, VP of Products, DataStax
Image source: Shutterstock/McIek