SCAPE & Hadoop

Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, or the validation of technical metadata require significant computing resources.  Processing very large data sets is also one of the core challenges of the SCAPE project.  This is exactly where Hadoop comes into play. For many of these data processing scenarios, frameworks like Apache Hadoop might become an essential part of the digital library’s ecosystem.

hadoop

What is Hadoop?
The open-source
Hadoop software framework allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale out from single servers to thousands of machines. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. 

Doug Cutting (who started the Hadoop project) named it after his son’s toy elephant, which also explains the logo.

The SCAPE platform provides an infrastructure for specifically executing digital preservation processes on large volumes of data. The Preservation Platform relies on Apache Hadoop as the underlying runtime environment and is used for large-scale testing and evaluation performed within the Testbeds and Planning and Watch sub-projects. One of the aims of the SCAPE Testbeds, where solutions for real-world institutional scenarios dealing with big data are developed, is to allow the automatic generation and execution of Taverna workflows, which must be migrated to an Hadoop-based execution system.

On 20 March 2013, Sven Schlarb and Clemens Neudecker presented the paper ‘The Elephant in the Library’ at the Hadoop Summit Europe in Amsterdam. An interview with them is available here: the presentation can be viewed below:

 

You can also read more about the use of Hadoop in SCAPE in these recent blogs:

The Open Planets Foundation will organise a hackathon on Hadoop, from 2-4 December at the Austrian National Library (ONB) in Vienna. Further details will be announced through http://www.openplanetsfoundation.org.

Leave a Reply