There have been a great many updates on the LODLAM front (Linked Open Data in Libraries, Archives, & Museums). I haven’t blogged about it as those life things continue to be kicking my a**. My health hasn’t been good blah blah blah. I highly encourage those of you with an ongoing interest to follow the goings on at the 2nd International LODLAM Summit. I attended the 1st, which was fabulous btw, and learned tons. Two years later projects are further along and new projects are being launched. Take a gander at the design patterns repository Richard Urban announced, for example.
Thanks for the inquiries into my well being. Rest assured, things will right themselves eventually. Or not.
If When I return to writing here, you’ll know it’s better.
It’s Monday. It’s time for a metadata movie. There’s been a lot going on lately re: Library LOD. I’d have been posting on it but, well, you know how it goes. I’m excited about the developments. I’m itching to resume work on our own faculty linked names pilot project. I’m almost caught up with my post-leave in-box and should be returning to that soon. Meanwhile, grab some popcorn and enjoy.
The Maryland Library Association has posted links to all of the presentations from the “Technical Services on the Edge” program held in December, where I spoke about Linked Data in Libraries, Archives, and Museum. The copy of my slides here contains the speakers notes. It may prove more helpful than the slides-only which I posted to slideshare.
As promised, I’m finally getting around to posting about the sessions at DLF Forum which were particularly awesome. The Linked Data: Hands on How-To workshop afforded the opportunity to bring your own data and learn how to link-if-y it. It held the promise of helping me get the Caltech faculty names linked data pilot a bit further along. I didn’t get much further in the process due to some technical glitches. Yet the session was still successful in a couple of ways.
First, I had another “a-ha!!” moment in terms of how Linked Data works. All this time I’ve been preparing our faculty names data with an eye towards exposing it as Linked Data. I realize that this build-it-and-they-will come approach is somewhat naive, but it’s a baby step in terms of getting ourselves up-to-speed on the process. What I didn’t fully grasp was that a data exposed in this fashion is just an endpoint, an object or subject others can point at but not really do much with. If it’s just an endpoint, one can’t follow their nose and link on to more information on other servers. Our data will only be truly useful once it can be used to complete the full subject-predicate-object “sentence” of a triple.
In practical terms it means rather than just putting out a URI associated with a faculty name, we should expose the faculty name URI along with other identity links and relationship links. Let’s use our favorite Caltech faculty name as an example. We mint a URI for Richard Feynman, let’s say http://library.caltech.edu/authorities/feynman. We ensure the URI is “dereferencable” by HTTP clients which means the client receives an HTML or RDF/XML URI in response to its query. Since we only have the name identifiers in our data set, that’s all the client will receive in the returned HTML or RDF/XML document. In this case we know Feynman has a NAF identifier http://id.loc.gov/authorities/names/n50002729.htm.
The entity working with this exposed data would have to do all the work to create links if all we exposed was was the NAF URI (and really, why wouldn’t somebody just go directly to the NAF?). Our data on this end would be much richer if we could make a few statements about it. We need to expose triples. As an example, we could create a simple triple related our URI with the NAF URI. We connect the URIs with the OWL web ontology “same as” concept. The triple would look like this:
<http://library.caltech.edu/authorities/feynman.html> <http://www.w3.org/2002/07/owl#sameAs> <http://id.loc.gov/authorities/names/n50002729.htm>
The data we’re exposing is now looking much more like Linked Data. We could go even further and start writing triples like Feynman is the creator of “Surely You’re Joking, Mr. Feynman” using a URI for the Dublin Core vocabulary term creator and a URI for the bibliographic work. The more triples we garner, the more the machines can glean about Feynman. It was a breakthrough for me to figure out that we need full triples in our project rather than simply exposing a name URI.
The second way the hands-on workshop was a success for me was that I had the opportunity to play with Google Refine. Google Refine is spreadsheets on steroids. It allows you to manipulate delimited data in more ways than Excel. Free Your Metadata has some videos which explain the process for using Refine with a RDF extension to prepare subject heading linked data (yes, I’m sneaking a Monday Morning Metadata Movie into this post). I was hoping to take my spreadsheet data of names and LCCN and get either VIAF or NAF URIs. That would get our faculty names linked data project to the point of having 1/3rd of a triple.
Unfortunately, the I could not get the RDF plug-in installed on my laptop. Some of my colleagues in the hands-on did manage to get it to work. We pulled in some very knowledgeable programmers to troubleshoot and their conclusion after about an hour of tinkering was that the plug-in was buggy. Back to the drawing board it seems.
There will be another Linked Data hands-on session at CODE4LIB 2012. I anticipate that it will be as useful as the DLF hands-on. I do plan on attending and I am keeping my fingers crossed that I can make some progress with our project. There is a great list of resources on the web page for the DLF session. There are other tools there besides Google Refine that I hope to play with before CODE4LIB. Plus there are links to other hands-on tutorials. Slowly but surely I’ll get my head wrapped around how to do Linked Data. I’m grateful the DLF forum furthered my understanding.
It’s a sad day in the library development world. Rurik Greenall, the kick-ass Linked Data developer at the Norwegian University of Science & Technology Library, has announced his intention to leave libraryland and work in industry where there’s more hope of doing great things with Linked Open Data. He writes that there is no real need for Linked Data in libraries due to the if-it-ain’t-broke-why-fix-it phenomena.
He’s absolutely correct. I’ve said it before. There’s little reason for most academic libraries to expose traditional bibliographic information as linked data. There really isn’t any reason to use Linked Data within the context of how libraries currently operate. Our systems allow us to do the job of purchasing resources, making them searchable for our customers, and circulating them to people. In harsh economic times, why spend time/energy/money to change if things are working?
He’s also incorrect. There are reasons for librarians to do Linked Data. I suspect Rurik knows this and his tongue is implanted in his cheek due to frustration with the glacial pace of change in the Library systems world. Yes, there’s no reason to change if things are working. But things won’t always work the way they do now. We’re like candle makers after electricity has been harnessed. People still use candles but not as their sole source of light. The candle makers that are still in business pursued other avenues. Other use cases for candles besides “source of light” became prominent. Think of aromatherapy (scented candles), religious worship (votive candles), or decoration. It will be the same for library catalogs. People will always use them, but not as their main source of bibliographic descriptions. The traditional catalog data will be used in other ways. In my opinion, its future job will be as a source of local holdings and shared collection management Linked Data.
It’s quite telling that when Rurik asked, “what are the objectives of linked data in libraries” prior to the LOD-LAM summit and heard the crickets chirping. The cataloging world has failed profoundly at understanding our raison d’être. I think we’ve tied ourselves too much to Panizzi’s & Lubetzky’s purpose of the catalog (explicating the differences between different expressions/manifestations of works) and lost sight of the purpose of providing a catalog in the first place — connecting people with information. Our work should be focused on assisting others in their information seeking & use rather than focused on managing local inventories. The FRBR user tasks (find, identify, select, obtain) don’t cover the full spectrum of information behavior in the 21st century. People want to analyze, synthesize, re-use, re-mix, highlight, compare, correlate, and so on and so forth. Linked Data is the enabling technology which will allow these new types of information behavior. The use case of libraries providing catalogs of descriptive bibliographic records for discrete objects is becoming increasingly marginal.
So I’ll propose an answer to Rurik’s question. The objective of doing Linked Data in libraries is to facilitate unforeseen modes of information use. How does this translate into new use cases for how libraries operate? Perhaps it means creating better systems for information seeking (we’d better hurry though. Google is kicking our ass at this…). Perhaps, as I believe, it means focusing more on helping our customers as producers of information rather than consumers. Putting legacy library bibliographic data into a Linked Data form is but one small first step in the process. Once it’s out there in Linked Data form, it’s more amenable to the analyzing, synthesizing, re-using, re-mixing, highlight, comparing and correlating because we can now sic the machines on it. Putting legacy bibliographic data into Linked Data form is how we’re going to learn how to do Linked Data. Rurik is right that Linked Data in libraries will not work if this is all that we do. We need to take additional steps and figure out how to do Linked Data in a way that makes the most sense for our customers.
Rurik worked in the trenches to bring Linked Data into the library world. I’ve often referred to his work as I struggle, mightily, to teach myself how-to expose our Linked Data and use the Linked Data exposed by others. The library world needs more people who can help librarians bridge the gap between how we currently do business and how we need to do business if we hope to keep our jobs. I begin to feel like we’re on the Titanic when these sailors jump ship. I will seek the life-boat and continue learning the skills I need to help my library’s customers with their information seeking & use.
For the past few months I’ve been writing about our pilot Linked Open Data project to expose name identifiers for Caltech’s current faculty. So far we’ve been working on creating/obtaining the initial data set. I’m very happy to report that we’ve now got 372/412 faculty names in the LC/NAF and, by extension, the VIAF. We expect to complete the set within the next month or so, give or take our other production responsibilities. Meanwhile, I’ve been messing with the metadata so I can figure out what the heck to do next.
We have the data in full MARC21 authority records. From that I’ve made a set of MADS records (thanks MarcEdit XSLT!). I also created a tab delimited .txt file of the name heading and LCCN.
According to the basic linked data publishing pattern, it can be as simple as publishing a web page. We are able to put out the structured data under an open license and in a non-proprietary format and call ourselves done. This is what you would call 3 star linked open data. We’d like to do a bit better than that. In order to achieve 4 star Linked Open Data we need to do stuff like mint URIs and make the data available in forms more readily machine process-able.
This is where I get stuck.
Fortunately, there will be a hands-on linked data workshop at the DLF Forum (#dlfforum) next week. I’m highly looking forward to it. I’ve volunteered to give an overview of linked data to Maryland tech services librarians in December. Getting our data out there will provide some necessary street cred.
There are many calls for the library world to get their act together and expose the bibliographic data in their catalogs as linked data. But should they?
It only makes sense if your catalog has a lot of unique data about your unique local holdings and those records aren’t represented in WorldCat. If you’re in that boat, this blog post doesn’t apply. I would bet that most academic libraries do their original cataloging with OCLC tools and then add those records to their local catalog. If your stuff is in WorldCat then let OCLC do the work. They are doing a pilot to release 1 million of the top WorldCat records as linked data. Eventually, this is going to be broadened.
Of course, there are problems in simply waiting for OCLC to do it. The biggest is rights and licensing. OCLC could make linked bib data available to it’s membership. But opening that linked data to the world is another matter entirely. Roy Tennant, commenting on June 2 at JOHO the blog (aka David Weinberger’s blog), says:
This is a project that is still being formed. The intent of the project is to investigate how to best release bibliographic data as linked data that will provide an opportunity for us to get feedback from other practitioners about how well we’ve done it and how useful it might be. A part of this project will include considering the policy and licensing issues. There’s not yet a firm timetable, but as that is determined it will be shared with the community. We’re encouraged by the enthusiastic response folks have shown in what we’re doing. It will help influence the shape of what we ultimately do.
You may own your records. But if they’re aggregated in WorldCat, OCLC may or may not release them depending on what licensing/usage model they come up with.
The other problem, of course, is the waiting. Nobody knows how long it’s going to be before WorldCat is a linked data resource. If you’re gunning to get your library catalog out there in the linked data ether you’re not going to be satisfied with the time lag. If you’re in an organization with limited resources, however, waiting will be to your strategic advantage. Times are tough. Budgets are small. Cataloging departments are understaffed.
Why spend resources when bibliographic records as linked data will come?
Focus instead on authority records and on enhancing your repository. This is the metadata unique to your organization. This is where you add value for your customers. The value for others is a nice side-effect. Faculty, researchers, and newly hooded PhD graduates will require identity management tools to assist them in scholarly communication. Think beyond bibliometrics for their publications. Grant makers will want to track researchers. Universities will want to track research impact. The ground is being sown. Witness the NSF grant to U. Chicago and Harvard announced today, which will be used to research the impact of ORCID on science policy (i.e. the ability to put it into use for things like FastLane).
When you hear the calls for libraries to expose their ILS as linked data, consider how your users are getting bibliographic information. Most likely from the web. WorldCat feeds into Google. OpenLibrary, LibraryThing, and other sources of bib data abound. They’re probably going to your catalog for holdings, if they’re going there at all. Bibliographic linked data is a good thing. I just don’t think it’s realistic for the academic institutions sans lots of unique stuff in Worldcat to heed the call. I’d rather that the calls for academic libraries to participate in the linked data movement get broader. It’s not about freeing the ILS. It’s about pushing out metadata that exposes the work of the people in your organization, not the metadata that exposes the library materials available to the people in your organization.
I’m back from the International Linked Open Data in Libraries Archives and Museums Summit which was held June 2-3, 2011 San Francisco, CA, USA. My brain is still digesting all that I learned. I’ve posted my rough notes in case anybody else can find them useful. I thought the event was really well done. I learned about several LOD projects which might provide tools we can use here at Caltech. I’ve got to dig in a bit more detail. The organizers will be posting a summit report with all of the action items for next steps. Of particular interest to me – there will be some work on publishing citations as linked data and there will be some materials released which can assist with educating the LAM community about LOD. I’ll probably write about each aspect as more information comes to light and as I wrap my head around it.
Re: my post yesterday saying I was unsure if the VIAF text mining approach to incorporating Wikipedia links within their records was Linked Data. There’s a good little conversation over at the LOD-LAM blog which elucidated the difference for me. They say it better than I can, so go have a look-see. The money quotation, ” Linked Data favors “factoids” like date-and-place-of-birth, while statistical text-mining produces (at least in this case) distributions interpretable as “relationship strength”.”
There’s been some conversation lately about using Wikipedia in authority work. Jonathan Rochkind recently blogged about about the potential of using Wikipedia Miner to do add subject authority information to catalog records, reasoning that the context and linkages provided in a Wikipedia article could provide better topical relevance. Then somebody on CODE4LIB asked for help using author information in a catalog record to look up said author in Wikipedia on-the-fly. The various approaches suggested on the list have been interesting although there hasn’t been an optimal solution. Although I couldn’t necessarily code such an application myself, it’s good to know how a programmer could go about doing such a thing. What I did learn was that Wikipedia has a way of marking up names with author identifiers. The Template:Authority Control gives an example of how to do it.
I haven’t done much authoring or editing at Wikipedia, so the existence of the “template” is news to me. I think it’s pretty nifty, so I just had to blog it. The template gets me thinking. Perhaps we’ll be able to leverage our faculty names Linked Data pilot into some sort of mash-up with Wikipedia, pushing our author identifiers into that space or pulling Wikipedia info into our work. Our group continues to make progress on getting all our current faculty are represented in the National Authority File, with an eye to exposing our set of authority records as Linked Data. We haven’t figured out yet precisely what we’re going to do with the Linked Data once we make it available. Build it and they will come is nice, but we need a demonstrable benefit (i.e. a cool project) to show the value of the Library’s author services.
VIAF already provides external links to Wikipedia and WorldCat Identities with its display of an author name. Ralph Levan explained how OCLC did it, in general fashion, in the CODE4LIB conversation. Near as I understand it, they do a data dump from Wikipedia, do some text mining, run their disambiguation algorithms over it, then add the Wikipedia page if they get a match. I don’t know if this computational approach is a Linked Data type of thing or not. I need to continue working my way through chapter 5 & chapter 6 Heath & Bizer’s Linked Data book (LOD-LAM prep!). Nonetheless, it’s a good way of showing how connections can be built between an author identity tool and another data source which enrich the final product. I have a hazy vision of morphing the Open Library’s “one web page for every book every published” into “one web page for every Caltech author.” More likely it will be “one web page tool for every Caltech author to incorporate into their personal web site,” given the extreme individualism and independence cherished within our institutional culture. But I digress. Yes. “One web page for every Caltech author” would at least give us the (metaphorical) space to build a killer app.