In this age of big data, it’s more important than ever to not simply gather data, but to interpret it too. Providing insights that can help change how people work or live for the better. That’s something I’ve always tried to instill in my work, and this was what David McCandless, the best-selling author of Information is Beautiful, made clear at his recent talk on big data and data visualization at the London Science Museum.
I first become a fan of David after his 2010 TED talk, and was recently reminded of his work again when I rediscovered his book on my shelf. Flicking through the pages and admiring how David summarises a whole dataset in one page led me to think about our own challenges at G-Research. Our vast and disparate datasets appear to grow exponentially each year and so does our need to present and visualise this data not only to researchers, but to the whole business. The obvious next step was to ask David to come and talk to us about his work and experience as a data journalist.
‘Data is the new oil’ is a phrase that we’ve all heard. David proposed a slight alteration to the expression: ‘data is the new soil’. “It’s a new form of material and matter that you can dig through and get your hands dirty in,” he said, and from it you can get insights to bloom. Data visualisations were a way of doing just that, and they do so in a vast array of ways. Over the course of an hour, David covered everything from galaxies to dog breeds to military spending.
The ubiquity of data across every topic is no surprise, of course. As I mentioned, it’s something my colleagues and I encounter every day at work. But when it comes to datasets, visualisations can tease out patterns and provide insights that wouldn’t necessarily be obvious when confronted by the raw numbers.
Information security is one such area. On the most basic level, think about passwords. They protect our valuable information, but by visualising information on the passwords themselves, we can see that we’re not as careful with them as we perhaps should be.
At his talk, David brought up a heatmap data visualisation of PIN numbers made by Nick Berry, a data scientist at Facebook. The lighter the colour, the more common the four-digit number. In an instant, obvious patterns emerged from the collation of thousands and thousands of PIN numbers – patterns that would’ve been a lot harder to tease out with just the raw data.
First, there was a block of light colour stretching up the y axis to 12 and down the x axis to around 30. The reason, David explained, was simple: people liked to go for a birth date and month combination. Whatever day you were born, your PIN number would fall in that block. With the whole range of numbers on display, what’s clear is just how small (and therefore more guessable) a range that self-limiting factor is.
The same was true of the diagonal stripe stretching up the graph – this represented 1010, 1111, 1212 and so on – and the lone, pale horizontal stripe towards the end of the x axis, about a fifth of the way up the y axis. Or, to be more precise, stretching from 1950 to 1999. The date of birth of a vast majority of credit card users.
On paper, these seem like obvious pitfalls. But what the data shows is that they’re not obvious to the vast majority of people. The priority of security can sometimes be for us to remember, not for others to fail to guess. If everyone reading this article checked their pin code right now, chances are that most of them would fall prey to the patterns above.
These kinds of insights can help us think about the information security problems we confront in our own industry in new ways. Data visualisations provide a means of honing down enormous or complex quantities of data into some key, actionable points. It’s something I am constantly aspiring to achieve in my work, as I believe it’s crucial to derive real-world applications from data. Otherwise, why have it?
Many companies, however, fall down on this aspect when it comes to making the most of big data. “Big data is a noun,” David said, “But it’s also a verb. ‘To big data’.” He outlined the steps in the process of this: first, the gathering and handling of the data; then, the structuring and examination of it; and finally, the discovery of insights and the delivery of those insights either within your own organisation or the world at large.
As I said, this last stage is the most important, and the stage that I’m always pushing for at G-Research. But David warned there could be a tendency in some companies not to embark on this final straight. It’s not enough to simply accumulate data, he said – you have to ask a question of it too. “And the smarter and more acute the question, the better the results.”
That wasn’t the only part of the talk that clicked well with the quantitative research industry. Another was David’s examination of types of data available to be studied and utilised: not just data – the soil – alone, but information and knowledge as well. Often presented simply as a pyramid with data as a base, David suggested a more nuanced breakdown of these substances.
“First, you start with the granular data, mining it and capturing it and bringing it together into structured data,” he began, noting that structured data was found in the ubiquitous spreadsheets and databases that can be found in every nook and cranny of our industry. “Once you start to identify and filter that structured data, you can create something more communicable: information.”
The internet has made it even easier to create another substance: linked information. Links and hypertexts are two ways that one chunk of information is tied to another, and then packed into compartments of knowledge. For David, “knowledge is cellular, contained. In science, if you’re an expert in one field, it can still be close to heresy to make a comment in another field. They’re sealed off.”
But you can link up these cells of knowledge, to create interconnected knowledge. This is what you use to predict and model, and this is what I have to use in my own role. It’s also something that more and more companies are going to need to embrace in order to make the most out of the limitless potential of big data. The process of turning data into interconnected is not an easy one, and nobody gets it right 100% of the time. Without it, though, the data will remain inert, and fundamentally not that useful.
At one point during the talk, David said: “We live in a connected world now. We can no longer take a single data point and claim truth or get upset. We have to contextualise it and build it into a network of meaning, so we can see a number in its particular context and understand it more.”
That’s exactly where the future of big data lies. In his TED talks, which have accumulated millions of views, Swedish statistician Hans Rosling would talk of the need to “let your dataset change your mindset”. Like David and myself, he believed in then tying data and design together. Big data isn’t just about hoarding as much as you can – it’s about using what you have to cast fresh light on the real world, whether that’s passwords, financial markets or the economic development of countries around the globe.
Matt Barsby, Executive Director and Software Engineering Manager, G-Research
Image Credit: Wright Studio / Shutterstock