In the modern world of Big Data and the Internet of Things, the old edict that knowledge is power has perhaps never been truer. Businesses operate in a constantly more cacophonous environment, and obtaining a competitive advantage in the collection and processing of data is rapidly becoming an imperative. According to Gartner, in about three years’ time, more than half of new major business processes and systems will incorporate some element of the IoT. Research from BI Intelligence predicts that over the next 5 years over $6 trillion dollars will be invested in IoT.
To give a sense of perspective to these numbers, current estimates put the number of connected “things” at around 6.4bn, a number roughly comparable to the number of people on earth, yet by 2020 there will be approximately 3 connected devices for every person alive. In this setting, businesses that can harness the power of data and maximise its value are on course to flourish, and those that ignore the information trend do so at their own peril. Yet, for the vast majority of enterprises, the question has long since moved on from whether or not to store and process data: that is a given. The questions that arise now are of a more specific nature: what kind of data should a business be interested in; how often should this data be accrued; how trustworthy is this data; what do I do to ensure I do not lose essential data over time and for how long should I keep this data?
The initial temptation is to collect as much data as possible and deal with processing it when the time comes. The Data Lake phenomenon is a perfect example of this attitude, with businesses using them to complement, and sometimes replace, their more traditional data warehouses. Data lakes allow you to store anything, and as much of that anything as you want – which is not necessarily a good thing. However technical the questions you ask, knowing what it is you are looking for (and why) is paramount when there are such huge amounts of data to sift through.
The Data Lake approach indicates at the very least poor planning, especially if the business is interested in maintaining pace with technological developments in data analysis, and at worse an indicator that the person making the decision is unaware of how much data exactly is meant by Big Data.
Insert heading here
To truly cope with the amount of generated data, companies should view their data in terms of a supply chain, having a beginning, middle, and end. An organised plan for how data is brought in, explored, transformed and retained allows businesses to maximise the value they extract from their records. Yet, while goodwill exists among many executives, many businesses still lacks the infrastructure to handle current data flows, and those that have recently caught up are facing scalability problems in a future of exponential data growth. This does not mean they should take a Spartan attitude to data, merely a more targeted one.
One of the first questions that should be asked when deciding on the infrastructure to use is whether the data needs to come in constantly or only when a specific event takes place. The former, time series data, carries a time stamp and permits its users to gain those valuable insights, power digital transformations, and drive more effective customer engagements. This type is heavier than the latter and almost by definition comes through with greater frequency, and so requires more adequate resources to manage. Of course people cannot do this without a supporting infrastructure to collect, process and store the relevant data, even when this data is not homogenous. Modern NoSQL databases inherently solve many of the data volume and diversity issues that come with IoT, but these too are having to evolve to provide optimum storage and retrieval for time series data.
Solving the unique challenges that come with time series data requires a database solution that is specifically optimised for the task. So, given that IoT will become an integral part of most businesses in the near future, what questions should enterprises ask when deciding which time series database solution is right for their specific data needs? The selection below is a good place to start.
- Does it have the right structure? Is the database in question designed to achieve maximum performance and availability for this specific data type? A distributed NoSQL database designed for time stamped data will provide the read and write performance as well as the scalability and availability that IoT applications require while running on commodity hardware, to reduce the overall costs of operations. Distributed design also allows businesses to keep the data close to the customer, empowering those personalised experiences using local data centres.
- Does the database optimise Co-location? Time series data is usually accessed by sources, locations and time ranges. A NoSQL database must guarantee fast response time to ensure the validity of data, no matter the total volume. This can be achieved with the optimal co-location of time series data by quantum of time, source and geohash.
- How fast and easy are range queries? Remember that a NoSQL database that uses a SQL-like query language will provide familiar semantics that users can quickly leverage to make it easy to write queries for data analysis.
- Does the database provide high write performance and availability? It is often key for time series databases to handle high write workloads, but the need for availability at high scale isn’t a trivial capability. Understand how a database continues to perform under failure scenarios and meet your availability requirement is essential. Masterless architecture is regarded as a prime choice.
- How does the database address data lifecycle? As time series data ages, the value of the data lessens and the business need for finer resolution reduces. To be able to store and retrieve the data efficiently, organisations must have the ability to roll up, compress or even phase out the data according to their needs. Consider data lifecycle management needs as part of your long term retention strategy.
While these questions are by no means exhaustive, they are a short guide in how to approach the selection of a time series database to manage a new IoT project. Of course, it may be that you’re not looking to deploy an IoT project, and a time series database is not what your business needs right now. In that case, the question to be asked is: “Will our business needs change in the highly competitive IoT environment of tomorrow?”
Image Credit: Chesky / Shutterstock
Emmanuel 'Manu' Marchal, MD, Basho