The Inquirer-Home

British Library unveils web archive

IBM’s Big Sheets powers it
Thu Feb 25 2010, 16:08

THE VENERABLE British Library has unveiled a UK web archive powered by IBM technology.

According to the British Library, the UK Web Archive gives users access to thousands of UK websites for generations of researchers and "demonstrates the importance and value of the nation's digital memory."

The UK Web Archive employed IBM to develop the ability to capture content and make it available. Specifically, it uses IBM's Big Sheets software to archive online content and improve appropriate methods of access. Big Sheets technology also enables the Library's web archiving team to extract, transform and annotate, as well as statistically and algorithmically analyse web pages, vastly speeding up the archival process.

"A new technology prototype, Big Sheets will essentially do for big data what spreadsheets did for personal computing," says Rod Smith, vice president of emerging Internet technologies at IBM. "We are delighted to be working with the British Library to develop the advanced software that will enable users to explore the mass of unstructured web data, and extract useful information for research."

IBM Big Sheets is based on the Apache Hadoop Java framework, and promises to process large amounts of data "quickly and efficiently". µ

Share this:

Comments
UK Website Archiving

To ensure that their website content is archived for the future, Organisations can automatically save daily screen-shots of all their web pages, which are then saved for either compliance, legal or just general interest purposes.

Cloud Testing, a UK company has just launched it's service Website-Archive, which is available at http://www.website-archive.com/ - because this is a self selected archive of people/companies own sites it gets round the copyright issue, or does it?

We get confirmation from customers that they are permitted to archive the content they ask us to, but in the days of multiple content streams, people often don't know what is actually being delivered via their website in terms of RSS feeds, Twitter searches/feeds, Adverts, news feeds etc. etc.

posted by : Phil Smith, 26 February 2010 Complain about this comment
ye olde archive...

You fail to name the previous web archive, archive.org

Its sometimes usefull, but also its not 100% since its often missing files such as images on archived sites. Will this new archive become better ?

posted by : Silver, 25 February 2010 Complain about this comment
Management is going to take a big sheet when they get the pun.

A programmer once told me he wrote a "queing analysis program" that he named "QuAP" (pronounced "kwap"); when management didn't catch the pun at first, he added "on-line analysis", making the acronym "Quapola".

posted by : bigger_luddite, 25 February 2010 Complain about this comment
aboutus
Advertisement
Subscribe to INQ newsletters
Advertisement
INQ Poll

Facebook starts selling shares

Will you buy Facebook shares?