Peter Murray-Rust talk on eThesis, webcast is available

Data-driven science and Digital Repositories Monday, August 6th, 2007, 10:00am, NewMedia Classroom

by Peter Murray-Rust

Unilever Centre for Molecular Sciences Informatics, Department of Chemistry, University of Cambridge, UK

UPDATES: Video is available (9/5/2007)

For the best playback experience, viewers should be on a PC with Windows Media Player installed. More instructions to view the webcast.

For several years there has been excitement about the potential of the “data deluge” where much science will be practised not by doing experiments but by reusing the information we already have. Although in some subjects such as particle physics and parts of bioscience this is starting to happen, in many others such as chemistry and materials science there is little sign of an impending deluge. The data have been collected but they are not accessible.

We have shown most raw data is never communicated outside the scientists’ laboratories and that it rapidly decays. In crystallography and analytical chemistry between 80 and 99% of data which is carefully collected ends up on a CDROM which, in a few years’ time, will be unreadable. We shall present a vision where this data can be collected and preserved. The challenges are technical, semantic, but above all social where we have to change the mindset of scientists to “preserve and share”.

We have developed a system, SPECTRa, for capturing the raw data in chemistry departments, first into an embargo repository and then into the main Institutional Repository (IR). The software is Open Source and we are seeking collaborators.

The current publishing system is often a serious impediment to sharing data. It emphasises “full-text” over data, but also puts legal and procedural constrictions on data flow. To bypass this we have started to investigate PhD and Masters theses as a primary source of information as these are solely under the control of academia and, in many cases, are the direct source of data which leads to formal publications. Our system is devised to create rich metadata from theses, using RDF and SPARQL as query tools.

Hopefully a number of demonstrations will be given.