British libraries to archive all UK web content

Five libraries, including the British Library, will archive all UK web content in a wide-ranging attempt to document the country’s online presence.

The archive, which is set to kick off tomorrow, will include tweets, Facebook status updates and an estimated one billion webpages spread out across 4.8 million websites. The project is an extension of the UK Web Archive, which was launched in 2004 but has been slow in collecting online data.

An automatic web crawler will be deployed to capture the data, with most of the websites published in the UK expected to be snapped once a year. More prolific websites, like those belonging to newspapers or magazines, will be archived as often as daily.

"Stuff out there on the Web is ephemeral," said Lucie Burgess, the library's head of content strategy. "The average life of a web page is only 75 days, because websites change, the contents get taken down.

"If we don't capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost."

After being captured, UK online content will then be preserved on a various servers, with file formats being updated as necessary over the coming decades, the library has said. In addition to the British Library, the National Libraries of Scotland and Wales, the Bodleian Libraries in Oxford, the University Library, Cambridge and the Library of Trinity College, Dublin will participate in the scheme.

The Internet Archive, a “non-profit digital library”, offers a similar but much larger service, having stored 281 billion webpages in the nearly two decades since 1996. On the social media side, the US Library of Congress records all tweets, though the institution is still grappling with ideas on how to make best use of the data.