Picking the right data model – how to make the right decision on your data management

The role of the data model is becoming more important.

Data has always been important to businesses. However, there is so much data getting created today – and so many new ways to work with it – that the role of the data model is becoming more important. The proliferation of new database technologies in recent years has made it increasingly difficult to choose the right approach for a particular application. So how do you make the right choice?

What are your database options?

Every database system has different characteristics, based on the choices made on how to deal with different issues around data management. Each of these decisions is a tradeoff based on what is most important. For example, is consistency of data more important than availability of the database? Should it be easier to scale up an instance or should the focus be on scaling out?

Alongside these considerations, there are different models for how to organise data within a database. These models work in different ways and can meet different business needs. These include:

•   Key-value - this describes databases where the emphasis is on the mapping relationship from key to value.
•   Wide Column Store - this covers support for relationships between the column-values of a single row.
•   Document- this includes databases that support nested relationships within a single document.
•   Relational - the most popular database type for the past twenty five years, these support arbitrary foreign-key relationships between relations. However, having numerous relationships within the data leads to complex schemas and JOIN-intensive queries.
•   Graph- rapidly growing in popularity, this describes databases that support arbitrary relationships between entities in a succinct representation and through direct access in the query language.

Each of these database types will have its benefits and drawbacks, and can be applied to specific use cases. However, as companies make use of more data within their operations, running multiple databases and data management platforms concurrently will become necessary. To make the right technical choice, it’s therefore important to look at how to achieve the best possible result for the business.

What should your data management model be?

Choosing the right data model can be made easier by discussing how valuable or important are the relationships in the data for our application? Answering this question means considering the following points:

•   How many relationships are there in the data?
•   How complex are those relationships, i.e. how many types of relationships are there and do those relationships have properties or annotations?
•   How frequently does the application query those relationships as part of a user request?
•   How frequently does the data produced by those users change? Are more sources being added on a regular basis?

Answering these points can provide indicators of how valuable the relationships within a data set will be. For data with lots of relationship points, the value will potentially be higher. However, these answers can be subjective ones. By looking at the complexity of the relationship - and the monetary value of that relationship to the business as well - it’s possible to make a more informed choice of which data management model to use.

So what do these relationships look like in practice?

•   A URL shortening service would use a database system to map user provided URLs to short and unique character sequences. In other words, the only relationship is between the original URL and corresponding shortened URL. This is a great fit for the key-value data model.
•   An advertising platform would need to capture various attributes from a list for each of the web properties under management. Due to the required flexibility and sparse row structure, a tabular model is a great fit.
•   A media management application would need to store meta-data about the different forms of media. For books, this would include information about chapters and sections. In other words, we are interested in the nested relationships of a book, which is a great fit for the document data model.
•   An ERP (enterprise resource planning) application would need information on various business entities such as inventory, purchases, and customers as well as some relationships between them. This is a great fit for the relational model.
•   A “Customer 360 application” involves bringing all customer information into one place from multiple channels, so that it can be analysed in context. This is a great fit for a graph data model.

Whatever application or service you are putting together; you may see opportunities to deploy an appropriate data management model. However, this will need to be supported over time. Similarly, the type of relationship that the company may want to track or create value from may change as well.

With this in mind, it can be tempting to choose a more expressive data model like relational or graph irrespective of what your application actually needs. However, a poor data model choice here can be costly, both in terms of support costs and the time to migrate over to another data model in the future.

From a software and application development point of view, the wrong choice around data models typically leads to additional development effort. However, the real challenges occur due to more complicated code being required around the data access layer.

With software maintenance costs to consider, the wrong model can lead to additional ‘technical debt’ to get things working and other dysfunctional software engineering behaviour due to work-arounds or more code implementations to get things working. Ultimately, this can result in both budget overruns and the risk of missing deadlines. Alongside this, there can be a lot of frustration and finger pointing.

As an example, you could use a relational database to integrate all your customer data into a “customer 360” application. While this can work from a technical perspective, it would require additional data modelling discussions to be carried out. It could also need the development of more code in order to implement the simplest recommendation query.  Even if you manage to create “customer 360” model, what if you wanted to change the way you view the data?

As an example, this could be used to transition from an individual customer’s record to that of the household or the location. In this example, we can make the “house” the centre of the model as this is the customer buying ongoing services like insurance and utilities through to more one-off requirements like maintenance.

In fact, this application could be supported with a document store as well. However, rather than using the database to run the data management logic and create value from the data, this would have to be located elsewhere. In this case, it would have to be held inside the application itself, which adds a tremendous development overhead over time.

By using the graph data model, the business logic can be maintained separately to the application. This would then compile a comprehensive profile of each customer including recent purchases, product views, written recommendations, shopping cart activity, and wish list entries. This data could then be used to look for patterns around who bought those products. Choosing a graph data model can therefore make it easier to maintain over time, as well as generating value for the business.

The flipside of this is that more complex data models should be discounted when they aren’t necessary. If you store a series of transaction records in a graph data model, using a graph query language like Gremlin will feel awkward and overly complicated. Simpler data models are easier to discuss from a performance perspective and often have less overhead. In addition, the mapping from the logical data model to the physical representation of the data inside the system is more straightforward.

More complex data models like relational or graph will often rely on query optimisers and a number of index structures for efficient query execution. This entails more data access paths, more configuration options, and higher complexity in the query execution. If you don’t need the extra relationship building capabilities of those data models, you could end up paying for their overhead both in terms of performance and for those with the right skills to manage the implementation.

Getting the best fit for the whole application or service

Choosing the right data model for your application or use case is important because a poor fit can be very costly. However, many applications will create data that may be used in multiple ways for the business. This can lead to situations where multiple data models may be needed for different use cases. An example here might be an application that uses IoT data and then analyses it for patterns and value.

Internet of Things devices tend to create a lot of data over time, sending updates back to the central servers on a regular basis. Get thousands of devices all providing their updates every fifteen minutes, and you have a sizable time-series database requirement. However, running analytics against this set of data may require a different set-up depending on how the data will be used.

Will the data coming in be used for “real time” analytics or only ever as part of a wider data set? Are there relationships across the whole data set that can be seen using a graph data model while the time-series data itself should be stored in a separate way?

By looking at what value can be created using the data, different data models may be required as part of the overall application or service. Running concurrent data models can therefore mean that data has to be moved or streamed to where it is needed. However, each data model and its accompanying data management platform will need to be supported, particularly for production systems.

Implementing these kinds of “multi-model” databases will become more important over time. For example, Gartner predicts that “by 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single platform," according to a research note from Nick Heudecker, Merv Adrian and Etisham Zaidi. The aim here is to make the support for production systems easier, while still picking the right data model for the application.

Looking ahead, the value of data models requires a great deal of thought to start off with. Making the wrong choice at the beginning of a project can add unnecessary cost and software development overhead. However, as applications scale up their use of data and move into production deployments based on value created by data, the need to run multiple data models concurrently will grow too. For developers, thinking in multiple ways around data will be an essential skill; multi-model data management tools will help them do this successfully.

Martin James, VP Northern Europe at DataStax
Image source: Shutterstock/alexskopje