International Linked Open Data in Libraries Archives and Museums Summit
June 2-3, 2011 San Francisco, CA, USA
Other useful related resources:
- LOD-LAM Google Group
- Notes from the Moving Out of the Metadata Ghetto discussion
- Notes from Scaling Provenance discussion
- Bibliographic Ontology (Bibo)
- CiTO, the Citation Typing Ontology
- W3C Library Linked Data Incubator Group Draft Report
The summit was envisioned as a working meeting with interested stakeholders from the Libraries Archives and Museums (LAM) community. The goal? “Catalyze practical, actionable approaches to publishing Linked Open Data” (LOD). We were tasked to:
- Identify the tools and techniques for publishing and working with Linked Open Data.
- Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata.
- Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions.
The event was successful to a degree. It’s my opinion that we were successful in articulating use cases and barriers to LOD implementation but less successful with moving the discussion forward to action items. That’s ok. Discussions will continue via the LOD-LAM Google group and that action items will be articulated soon. The meeting was run as an “un-conference” using the Open Space Technology meeting format. Participants begin by brainstorming the types of sessions they want to facillitate and/or attend. Those suggestions are posted to an open schedule board. Once the schedule fills up, then the discussions commence. As with regular conferences there are multiple interesting sessions in the same time slots, so choosing was difficult. Fortunately, each session will get written up and shared with the attendees. Also, organizers Jon Voss and Kris Carpenter Negulescu will be creating a summary with action items for follow-through. The first day of the summit was to discuss issues and second day was to follow-up with action items and next steps and/or hands-on sessions for starting to do actual work. Session ideas on the first day included: creating use cases for end-users using LOD, use cases in Natural History, adding annotations to LOD, rights and licensing of making data LOD, LOD citations, educating LAM community on LOD.
There were three session slots available for discussion on the first day. We used the final session to do “dork shorts” — two minute briefings on projects currently in-process.
The first session I attended was on educating LAM on LOD. I’m very interested in training metadata librarians on how to provide LOD. It was run by David Weinberger (yes, that David Weinberger) from the Harvard Library Innovation Lab. People ended up discussing the history of LD, and how to define namespaces. I am familiar with this already so didn’t feel the need to stay since they didn’t seem to be getting around to articulating needs for LAM training. If they got to it, I’d grab the info from the notes.
I moved to a session on “moving out of the metadata ghetto.” We discussed potential reasons why LAM metadata is currently silo’d and thus marginalized. Karen Coyle told us that LC is considering setting up some sort of governance structure for the LAM community to maintain LD vocabularies, since there are issues with authority (in the sense of authoritativeness) of the data and how vocabs will persist over time. This engendered a discussion of what was needed to have LD vocabs persist, both as full sets and at the individual term level – we concluded that there was a need for contextual information in addition to vocab content. Ultimately persistence is a human problem and it depends on a community of practice to ensure their vocabs get described and used. We discussed the notion of a minimal viable record for context. I suggested that it would required: who made the LD, when it was made, dates covered, topics/subjects – esp. the domain where the vocab was in use, relationships to others LD/vocabs (who could use it, stuff you have used to make yours). Need to make the implicit knowledge of communities of practice explicit for other users of the LD. Need a way to publicize the longevity of your LD. How will people consume the LD? Link to it? Copy it to their own servers (people tend to do that now). There is a need for people to market their data once they make it available – find somebody to use it, make a demonstrable project. We should re-use CKAN to propagate LD, and SKOS may be useful as a vocab of vocabs.
We concluded that more discussion is needed, and will follow-up via LOD-LAM email list and Google Doc.
I attended the discussion on making citations into LOD. The question we began with – is LOD a new way of citing? Discussed different types of citing behavior. There is an ontology for this: CiTO, the Citation Typing Ontology . There is uni-directionality in traditional citations. John Wilbanks told us how Creative Commons did some experimenting with extending trackbacks and ping back functions from Atom feeds to track citing behaviors. Found they needed to be careful since those tools are spam vectors. The did some labeling and directed graph work with it, but only at the whiteboard stage. Stanford SULAIR is doing some experimentation with annotation networks.
Discussed problems of content versioning within citation – which version of a document are people using – likely be confounded, since pre/post prints aren’t behind a pay wall thus likely to be used more. Versioning issue also exists for data, in two types of ways. 1) superseded data and 2) complex data interactions (ex. using gene sequence data with gene expression/variation data, and patient data merged together in Bayesian model) they have to normalize/rationalize data. Have to capture the conversations and workflows that take place around the data (make graph?). And where will OpenURLs fit in with that?
William Gunn described how Mendeley working with “relatedness” as a web service, annotations just the tip of the iceberg in their opinion. There are implicit relationships between two “buckets” of content.
Ideas on how to move forward and get “something out the door”?
- Drop in URL on a web page, simple pull down menu: “I used this data set to … ” and use that info to create a triples.
- Other use cases: digitized images with canonical URLs, ex. of how NYPL made their image URLs available as embed code for web pages
- Structured reading lists/syllabi another way of measuring impact of work. Dan Cohen did a Syllabus Finder via a Google SOAP API, but it’s frozen in time since Google has killed support for those types of APIs.
To do list if we’re to go forward:
- continue work on ping back/track back
- continue work on providing structured data as embed code for web developers
- persuading institutions to provide persistent URLs for their LD
- convince grants funding agencies to include “use/usability experts” for research teams in order to generate better web structures for cite-ability & increase use of end results of research
- create user interfaces for “citation generation” which put the citation in a universal storage format but generate various display format on the fly (APA, Chicago, etc.) [my comment: need to leverage work already done, since bib databases already provide that, we’ve got it in ePrints]
- create a like button/cite button research widget
UPDATE 06/08/2011 – John Wilbanks has posted his thoughts on the session
Our group got back late from lunch so we missed 1/2 of the session. I attended a session on scaling provenance , led by Bradley Allen from Elsevier Labs. Allen articulated a provenance problem which I’ve heard Anne Gilliland describe for the past 10 years as “pearling.” ex. 50 million articles, generates 100 LD statements and 10 to the 20 provenance statements. Would need 10 to the 12 bytes of disk space to keep the provenance information of the 50 million articles @ 100 bytes per provenance statement.
We tried to figure out if this was really a linked data problem or if it was a real problem at all (with decline in price of disk storage). What is the impact? has impacts on infrastructure, tools, workflow, and best practice. What does LAM offer by way of solutions?
What is necessary functionality? Need appropriate technology for various use cases (Solr, triple stores, mySQL statements)> Decouple delivery from storage. Account for levels of granularity (document vs. document fragments). Attribution for annotations.
Conclusion: Get use requirements to drive level of description. Incrementally evolve to satisfy different use cases.
Final Session – “dork shorts”
Participants described their projects. Of interest (and for me to follow up with):
- Waterloo guy going to do a workshop on modeling complexity in organizational names/identifiers (but more towards historical organizations and humanities focused)
- Roy Tennent announced that OCLC will be releasing 1 million of most used bib records as open data.
- Aaron Rubenstein, Simmons, created proof-of-concept called “Archival” which is a web service for augmenting names in archival collections (from EAD) with FOAF data.
- intersect.org.au from ands.org.au a shared db for researchers with (location, name, scholarly vocabs) research data sets and journal . It’s funded infrastructure in Australia.
- Dean Kraft pitched VIVO – will be ingesting info from grants and HR dbs at universities as “authoritative data source” The VIVO core ontology extends VIAF and BIBO
- Perrin from Balboa Park Collaborative – harvesting digital content from various museums — need to follow up with her and find out which digital asset management system they’re using to combine metadata from disparate sources. They expect to have their combined web site up in a year.
- W3C LLG has released draft report and needs comments by August. Next steps will be implementing w3c “community groups” around LLG and anybody with interest in it can join.
- The Internet Archive will be publishing OSS tools sound to extract metadata from web sites and combine those with information already in the IA such as images, audio, full text books etc. (maybe good for us for archiving CIT web sites?).
- Mendeley APIs available. They working on citation & use & attention metadata and those are made available via the API. These are collected in real time. Their collection of docs is growing 7 million per month.
- explanation and pitch for CKAN — use it! find LD to integrate into your projects!
- LOCAH exposing EAD as linked data, amalgamate archival description into their COPAC (using Tallis data store), the data is in MODS which they put into RDF.
- Adrian Pohl’s colleague working on “culture graph’ project to match bibliographic data with equivalence information — will be extending into authority and geographic data (really need to find out about this? what’s the guy’s name?)
- Elsevier Labs focusing on “smart content” which is their content enhanced with Linked Data. Doing it in 3 steps. 1. coming up with standard RDF 2. building a LD repository that fits into their publishing work flow. 3. Building up new content resources with SKOS vocabularies. They are open to integrating their stuff with 3rd parties (not clear that this will have the “o” of LOD)
- Norwegian institute of science and technology — really one to watch. NTNU.edu libraries. Not using traditional library metadata at all. want to “use nothing but RDF” Pulling together authority data, with dbpedia, archival EAD, timelines. SIMILIE widget.
- MIT/McKenzie Smith – PI for SIMILIE which is a OSS LOD publishing framework with no need for programmers. Can be used to layer interface over institutional repositories, or create bibliographies.
- BISCICOL project . John Deck Berkeley Natural History Museum. collecting data from field work in ecology/ecosystems data spaces.
Same type of round robin to develop sessions, but focus was to be on action items & next steps.
Vocab maintenance toolkit RDF data preservation framework. Expectations: resolvability, representation types. for the persistence/preservation of RDF documents (as special case). Understand meaning of this? Deprecation used but has scale issues. Set of best practices, include RDF to help with persistence and durability. PIDs trust to address link rot.
What best practice for scholarly publisher? LCSH-DCMI-FOAF
What partnerships between memory institutions & innovators of new vocabs? eg. LCSH as LOC role are service provider.
Action: need a alignment primer for LOD-LAM communication, for best practices to find best value of mapping and alignment. Vocab owners publish RDF schemas. DCMI declares as part of domain. Will do 1 day workshop at DCMI conference
- improve tools for vocabulary maintainers
- leverage LOCKSS and Momento
- create best practices document for declaring and mapping alignments.
- have meeting that pulls together all the stakeholders
I once again did two sessions. First I sat with the LOD education group then moved to the discussion on recently announced schema.org from Google, Bing & Yahoo. They developed some action items to help educate LAM on LOD
Eduction action items
- create FAQ for librarians
- develop model to show cost savings from using LOD? cost out how much it takes to provide LOD?
- review benefits section of W3C Library Linked Data Incubator Group (LLG) draft report
- does the big 3 microdata approach provide enough information to find LAM materials? really geared for search engines and not for re-using data. fears that it will undermine the RDFa work already done