Data drives decisions. Accessing the right data at the right time unlocks innovation. And there’s the rub. Finding the right data in the right format quickly is not easy.
Ninety percent of the world’s data was created in the last two years alone. The volume of data is doubling every two years and, by 2020, it’s projected to reach 44 trillion gigabytes.
That’s a lot of data to sift through, especially if you are looking for specific, niche information to understand and uncover a unique insight to transform your business. It really is the proverbial needle in a haystack! And that means it takes valuable time and resource.
Indeed, we recently asked a panel of elite data scientists how true they consider the 80:20 data paradigm – where 80% of time is devoted to sourcing the data and getting it into a useable format and 20% of time is used on analysing it to generate insights. The majority of respondents (82%) either agreed or saw the situation as actually much worse.
The biggest headache of the data revolution is the time and resources spent by data scientists and business analysts in finding the right data.
Endless data bug bears
The bug bears are numerous – incomplete metadata; data in proprietary formats that need to be converted; overly large and cumbersome data sets that are difficult to manage; unreliable or dirty data sets; license issues. The list goes on and on.
Some would-be insights rely on a very specific element of data – perhaps one year of a data set that spans 20 years – but to access that data you have to commit to the entire data set and the related cost and resource required to pull out the element you require.
Even within standardised data, some pointed out, source providers may use fields in a different way from each other – thus making it difficult to crunch the numbers effectively.
In relation to incomplete metadata, that’s not the sole problem, at times metadata itself is very messy, unclear and huge. Columns get added to datasets over the years, leaving the data and metadata not in sync with each other and/or previous versions.
Such delays and data barriers can have serious consequences. Some potentially life changing insights cannot be progressed because data sets are not properly anonymous – such as when using healthcare data. Even when this problem is not potentially life-threatening, data scientist or analyst productivity is severely reduced by the majority of time taken on data sourcing & preparation.
You get the picture.
Addressing the headache
These frustrations go counter to the huge excitement about the potential offered by the data revolution. That potential can only be realised if data can be discovered easily, understood clearly, accessed seamlessly and then used effortlessly.
Time is money and managing your data should be as easy as managing your music, photo and video libraries online using the likes of Spotify, Netflix or Amazon. Speeding up the time it takes to locate and unlock specific data sets would free up time to focus on unlocking innovative, valuable insights to inform strategic decisions.
This need is amplified as the data bandwagon gathers ever-greater speed with, not only more advocates, but more and more data coming online.
Take for instance geospatial data. Almost every piece of data has a location and time aspect to it. The potential insights are therefore increased making it potentially more valuable. Geospatial data insights cut across industry boundaries. For instance a health application could use geospatial insights to help understand disease spread geographically whereas the renewable energy sector may find specific geo data from dedicated drones hugely useful in monitoring and planning their wind farm maintenance programme. Central to all of this is the quality and accessibility of this data.
The continued rise in development of ‘smart devices’ and purchase of connected fitness trackers, watches, speakers, TVs, cars, alarms, kettles, fridges etc – means the Internet of Things (IoT) is becoming a ubiquitous reality. Each of these will create more data and most consumer or commercial applications require access to analysis of that data faster & with minimal effort to interpret. So, IoT also means the need for easily accessible, well managed data, now, will only accelerate.
It is time the industry woke up to the practical needs of those at the front-end of this dynamic industry. Data owners need to ensure the quality of their data and ensure it ‘does what it says on the tin’.
To date, there have been two main offerings on the market; for closing this data. The oldest, is the role of data providers or brokers. For too long, such businesses have operated like data equivalents to ‘oil barons’. High prices and demanding contracts have protected their income streams, at the expense of transparency or flexibility for their data customers.
In more recent years, the rise of ‘open data’ has presented a second option for some data science work. A move to making more government data openly available online, by a number of major governments, has created huge repositories. But, a bit lake the old jibe about data warehouses becoming data graveyards, many are not easy to navigate. Given the sheer scale of the data uploaded to these sites (often with insufficient metadata), finding what you want, when you need it, can be very time consuming.
There needs to be increased flexibility in options for buying data and more ability for data scientists to discover and access just the ‘slice’ of data they need. Let’s simplify the process. It makes sense to invest resources on generating the insights that inform decisions rather than on searching for the relevant data in the first place. Indeed this is driving increasing dialogue in the industry on the growing need for data engineers alongside data scientists.
If barriers to accessing that data are broken down the opportunities are endless – and, crucially, likely to be achieved more quickly.
Steve Coates, CEO, Brainnwave
Image Credit: Alexskopje / Shutterstock