Solving the big data problem

If any single concept could possibly characterise the beginning of the 21st Century, a major contender would be the exponential growth in data. With the rise of the Internet, and in particular the arrival of user-generated content and crowd-sourcing applications like Wikipedia, we're now all busying ourselves with the creation of new data on an hourly basis. Eric Schmidt, CEO of Google, has estimated that there are already five million terabytes of data on the Internet, and this is growing at a phenomenal rate.

But all this data is currently located in disparate systems, databases and formats. It’s often hidden behind proprietary programming interfaces and structures, too. The information you create about yourself on Facebook cannot be easily transferred to another social network, and the data on individual websites will be organised in many different ways. Email can be a hugely valuable business information resource, but aside from being organised into conversations by subject line or author, email is fundamentally unstructured. It’s no wonder that so many of us simply delete emails, treating them as similarly ephemeral to a conversation in your company’s hallway.

The benefits of bringing all this data together, and organising it properly, are potentially very great indeed, and not just to the security services or when engaged in litigation. While there are clear advantages to being able to correlate, for example, footage from security cameras with data from smartphone messaging systems and social networks, this is really just the beginning. Connecting information about various aspects of your holiday booking could be really useful. Imagine if your airline, airport parking or taxi service, hotel booking, and car rental booking were all connected, so that you could keep track of them all in one place.

TripIt actually does some of this. You can set up a trip, then simply mail your flight and hotel booking confirmation emails to the service, and these will be parsed and added to your trip. TripIt will know who you are from your sending email address, and will guess the relevant trip from the dates. It will also do incredibly useful things like provide a Google Maps route from the destination airport to the hotel. But only some airlines and hotel confirmation formats are supported, and car hire or airport parking aren’t supported at all. Imagine, though, if this information were even more directly linked into services such as Google Maps trip planning, or estimated time needed for check in. Sensible suggestions about when to book extra services – and when to notify them if you will be delayed – could then be made.

TripIt will take your flight and hotel confirmation messages, weave them into an itinerary, and even make route suggestions between your destination airport and accommodation.

In other words, technologies that bring data together could have a huge amount to offer in the coming decade. One of the contenders for some time has been the Resource Description Framework (RDF). This is a generalised metadata model that was developed by the World Wide Web Consortium to classify information. It was inspired by the Meta Content Framework project developed by Ramanathan V. Guha at Apple and Netscape, with some contribution from Dublin Core and the Platform for Internet Content Selection.

Essentially, RDF tabulates data, such as the content within a Web page, to make it more readily categorised. For example, when a person is mentioned in an article, their name can be indicated as their name and their title as their title. It can also be indicated that they are a person, and not a company or a brand of ice cream. This means other data about them can be made readily available – their email address, if publicly available, their Wikipedia entry, and links to other sites where the same person is mentioned (or, at least, someone with the same name). All of these could, of course, be provided dynamically by parsing a document on the fly. But this will inevitably lead to errors, such as suggesting the Morris motor company or a quaint form of traditional folk dancing when in fact a seasoned technology journalist is being referred to when this name is mentioned within an article.

You will probably use a format that owes its heritage to RDF every day. The humble RSS feed has standardised the way news articles are structured, making it possible to integrate them into a wide variety of apps and allow news from disparate sources to be displayed in one page. RSS actually first stood for RDF Site Summary, although an alternative version was also developed by Netscape called Really Simple Syndication. RSS 1.0 feeds are formatted in RDF, although RSS 2.0 (which isn’t actually the successor to RSS 1.0, but a simplified alternative development strand) is primarily XML based. If you look at the raw code for an RSS feed, it breaks information down into standard tagged sections such as author, title, category, channel, description and publication date, so readers can display the information desired for a specific view, for example, a title summary or the whole article.

Zemanta uses RDF tagging to provide related article suggestions for posts you create in blogging software.

The method for adding RDF metadata to XML-based articles in general is called RDFa, although there are alternatives such as eRDF. This lets you define a namespace (the list of categories you will be using), and then tag content within a page with those categories. A service that makes extensive use of RDFa data sources is Zemanta, which is a blogging plug-in that suggests links and extra content based on a post’s text. It can be added to popular blogging platforms like WordPress to provide these facilities as you write a new post, or edit an old one. For example, Wikipedia already publishes its articles online in RDFa format, which is part of the reason why so many third-party sites have been able to rip these articles and present them as their own content. But it also allows the information to be extracted for other purposes. Zemanta will provide relevant Wikipedia links, and associated Wikipedia images with Creative Commons licenses that you can use in your own articles without fear of copyright infringement.

Top Down versus Bottom Up

However, while providing appropriate categorisation as you create new Web content is the clearest approach (called ‘bottom up’), it’s another layer of work and there is also a huge amount of pre-existing data out there already that wasn’t produced with this kind of structure in mind. So there are numerous opportunities for companies that provide systems for finding structure in huge data sets (called ‘top down’), and these opportunities are beginning to be taken.

The UK’s DataSift (datasift.com) has eked out a very successful niche for itself as the premiere analyst of Twitter feeds. Among other things, DataSift’s services have been used by The Guardian for its incredible visualisations of rumours spreading across Twitter, such as during the riots last summer. DataSift offers a searchable archive of Tweets dating back to January 2010, and ingests around 250 million Tweets a day. It even includes location information, social media influence based on Klout, and whether the Tweet was positive or negative.

The Guardian makes very inventive use of Twitter data from DataSift to produce these visualisations of rumours spreading cross the social networking service through time.

Australian company Nuix, on the other hand, has created a sophisticated system for ingesting emails in common formats and other file types, primarily for forensic purposes. Clients are as diverse as the Royal Thai Police and various global offices of KPMG. These services clearly make dealing with court cases and public accountability much easier, as a holistic picture of communications can be built. This can then be used to apportion culpability, for example whether a company director had been made aware of nefarious practices within a news publication they run. This is an application usually called electronic discovery, or eDiscovery.

Kaggle has even crowdsourced data modelling and turned it into a competition, although this time with a predictive edge. A host provides data and a description of a problem, plus prize money for the competitors to win. Participants then suggest models, which are scored for their predictive accuracy and placed on a Kaggle leaderboard, until a winner is announced. Competitions have furthered medical research and traffic forecasting, as well as forming the basis of numerous academic papers.

Kaggle has made big data analysis into a competition, with prizes for coming up with the most accurate model for a data set.

There is a very direct link between these data mining provisions and artificial intelligence research. This is why the third Web era, or Web 3.0, is also regularly known as the Semantic Web. However, it must be pointed out that in most cases the organisational work being performed is there to make data more readable by machines, not by human beings directly. Having databases that talk to each other or can be combined easily is a step in the direction of making computers more aware of context, which is a key attribute of human intelligence. But it’s not, on its own, going to turn the Internet into something out of an early William Gibson novel.

Nevertheless, it’s clear that the sorting and categorisation of the large volumes of data surrounding our digital lives will define the next couple of decades. Successful products will be created to sift through this data and provide novel ways of combining it. Security services, and possibly jealous spouses, will be able to prove that we weren’t doing what we said we were doing on Tuesday night. Although some of the implications are potentially quite scary for civil liberty, there’s no denying that the era of Big Data, with Web 3.0 at its centre, is already here.