For many companies, the ability to use data in real time is now an essential part of how they are competitive. The sheer volume of data available to companies can be a blessing or a curse, depending on how well they are able to handle that data as it is created. At the same time, the adoption of 5G should lead to more data being created.
To cope with the deluge of data that these applications both create and demand, companies are turning to new databases, and the majority of these are based on open source. According to Gartner, more than 70 per cent of applications developed will run on open source databases by 2022, while 50 per cent of existing applications will switch from traditional proprietary databases to open source equivalents.
Deciding on the right approach for data
For Internet applications and mobile apps, the number of users that these services will have to support will mean that any move will have to support potentially huge numbers of transactions. At the same time, those applications have to be continuously available for customers. So any new database implementation will have to deliver that same capability.
Alongside this, any new database implementation will have to run in a distributed environment. With customers spread across the world, centralising applications and infrastructure has a poor impact on performance. At the same time, developers are using microservices to design their applications so they can scale up their applications more easily. This leads to the supporting infrastructure needing to support this approach too.
From a database perspective, running in a distributed fashion involves a couple of design choices. The first of these is whether to look at a model based on a ‘primary and secondary’ or ‘master and slave’ architecture, or at one that is fully masterless. Primary and secondary distributed database designs would include traditional relational databases and some NoSQL databases. These implementations have one node that directs all operations across a cluster, while the secondary nodes hold the data and carry out operations. However, the primary node can act as a bottleneck and single point of failure. Scaling up these clusters can also be more quickly limited.
Masterless database designs, such as Apache Cassandra, instead let any node be responsible for any transaction that comes in, and then organises copies of the data across the cluster for data resiliency. Data on any node that fails is also held on other nodes, so customer requests can be supported; equally, the data can be spread across multiple locations (‘data centres’ in Cassandra terminology) so even losing a whole location or cloud service will not affect the application or the service being delivered to customers.
To make the right decision here, there are several questions that you should ask. Is your application centralised or decentralised? How much data do you expect to cope with? How much downtime can your application cope with? Where will data be created and come from? And finally, how long will you have to retain that data for?
For applications that are centralised, have moderate to high data availability requirements, and moderate data growth, relational databases will be more than adequate. For applications that are decentralised like web or mobile apps, have continuous availability needs, and high data growth requirements, then a non-relational database will be a better fit.
Getting your requirements right around data
Alongside picking the right database, you’ll also have to pay attention to your data model and how you design your approach. This describes the attributes of data that you will be collecting and how they are arranged.
Typically, many new application designs will look at how to use frameworks to encapsulate and represent database components as application objects. This may use tools like Hibernate or LinQ to abstract the database specifics. However, this may not be the best approach for databases like Cassandra. Instead, using a Data Access Object Design approach can work better. Using Data Access Objects to encapsulate that data manipulation logic and design this upfront as much as possible can help application design teams as they want to build out higher level functionality.
Alongside this, there are several design requirements that you will have to consider. First is Data Movement Design – how your database will support integration with other systems for Extract Transform Load (ETL) operations, Change Data Capture (CDC) and feeding data into data pipelines. Capturing this data logic clearly is essential when it comes to making any integrations work effectively. Documenting this early can help make it easier over time when it comes to running transformation logic, error handling or data normalisation tasks.
Second here is Operational Design – this covers the day-to-day tasks required to run your database over time. Example tasks here would include how new nodes are deployed, configured and upgraded within clusters, how backup and restore operations are carried out, and monitoring and repair activities. This should also provide guidance on any disaster recovery processes that are required over time. By formalising these processes in advance, you can create playbooks of requirements that can be followed across the team.
The third area will include looking at areas like search and analytics. The data held in databases will frequently be needed for analytics or search operations, so defining these needs upfront can help everyone understand their requirements. For search, looking at how to define searchable terms, returned items, tokenisers, filters and multi-document search terms can make it easier to implement search across huge volumes of data without affecting performance. Similarly, understanding which analytics components will be needed and how they will integrate into the database can make the process much simpler.
Implementing databases at scale
Once the design process has been completed, it will be time to implement. This should consist of four stages:
Infrastructure – this covers how any hardware or cloud services are specified and software installed. With many developers now looking at cloud for running their databases, this process can be much faster than traditional implementations that required a mix of server, networking and storage hardware to be in place. Taking a hybrid cloud approach where a mix of internal IT and public cloud services can be used alongside each other is one good approach here, as this can provide more flexibility for future choices. If you want to deploy across multiple cloud services simultaneously, then considering this at the beginning can make things much easier.
Configuration – once the infrastructure is in place, it’s time to look at how the components are set up, connected and configured. For distributed computing environments in particular, this can be a critical time. Where possible, you should look at how you can automate operational items as this can optimise the process and cut down on potential human errors affecting deployment success.
Testing and Pre-production – checking that any new implementation is doing what it should be doing is a necessary step that should not be avoided. Unit testing can be difficult for distributed computing deployments, as it can be hard to see some of the issues that affect these environments when single nodes or units are implemented. Instead, it is worth looking at whole test clusters to get a better feel for how the service will perform in production when working ‘at scale.’ As part of this, going through an operational readiness checklist for dealing with common operational tasks like adding nodes or dealing with data centre deployments should be included
Scaling out – Once the initial deployment is up and running, it’s time to scale out the deployment. This represents how the database should run in normal operations, and support the application’s expected performance and data growth patterns. This can be achieved through adding nodes to expand capacity for data retention.
Making the move successfully
The shift to a new database approach provides an opportunity to rethink some of your ideas around data. For today’s businesses, a NoSQL database can support far more data writes and reads than traditional databases are capable of – a key consideration when applications have to scale up to deal with thousands or millions of transactions per second. When so much of our daily lives takes place through apps, the ability to scale up and support data is fundamental to success.
Planning any migration around data will involve looking at data modelling, at operational tasks, and how supplementary tasks like analytics and search can be accomplished in advance. By looking at these tasks as prerequisites early, any migration can be more successful.
Patrick Callaghan, Enterprise Architect, DataStax