DLF Linked Data Hands-on Session & 4M

Posted by laura on December 5, 2011 under Semantic web | 2 Comments to Read

As promised, I’m finally getting around to posting about the sessions at DLF Forum which were particularly awesome.   The Linked Data: Hands on How-To workshop afforded the opportunity to bring your own data and learn how to link-if-y it.  It held the promise of helping me get the Caltech faculty names linked data pilot a bit further along.   I didn’t get much further in the process due to some technical glitches.  Yet the session was still successful in a couple of ways.

First, I had another “a-ha!!” moment in terms of how Linked Data works.  All this time I’ve been preparing our faculty names data with an eye towards exposing it as Linked Data.  I realize that this build-it-and-they-will come approach is somewhat naive, but it’s a baby step in terms of getting ourselves up-to-speed on the process.  What I didn’t fully grasp was that a data exposed in this fashion is just an endpoint,  an object or subject others can point at but not really do much with.  If it’s just an endpoint, one can’t follow their nose and link on to more information on other servers. Our data will only be truly useful once it can be used to complete the full subject-predicate-object “sentence” of a triple.

In practical terms it means rather than just putting out a URI associated with a faculty name,  we should expose the faculty name URI along with other identity links and relationship links.  Let’s use our favorite Caltech faculty name as an example.  We mint a URI for Richard Feynman, let’s say http://library.caltech.edu/authorities/feynman.  We ensure the URI is  “dereferencable” by HTTP clients which means the client receives an HTML or RDF/XML URI in response to its query.  Since we only have the name identifiers in our data set, that’s all the client will receive in the returned HTML or RDF/XML document.   In this case we know Feynman has a NAF identifier http://id.loc.gov/authorities/names/n50002729.htm.

The entity working with this exposed data would have to do all the work to create links if all we exposed was was the NAF URI (and really, why wouldn’t somebody just go directly to the NAF?).  Our data on this end would be much  richer if we could make a few statements about it.  We need to expose triples.   As an example, we could create a simple triple related our URI with the NAF URI.   We connect the URIs with the OWL web ontology “same as” concept.    The triple would look like this:

<http://library.caltech.edu/authorities/feynman.html>  <http://www.w3.org/2002/07/owl#sameAs> <http://id.loc.gov/authorities/names/n50002729.htm>

The data we’re exposing is now looking much more like Linked Data.  We could go even further and start writing triples like Feynman is the creator of “Surely You’re Joking, Mr. Feynman” using a URI for the Dublin Core vocabulary term creator and a URI for the bibliographic work.   The more triples we garner, the more the machines can glean about Feynman.      It was a breakthrough for me to figure out that we need  full triples in our project rather than simply exposing a name URI.

The second way the hands-on workshop was a success for me was that  I had the opportunity to play with Google Refine.  Google Refine is spreadsheets on steroids.  It allows you to manipulate delimited data in more ways than Excel.  Free Your Metadata has some videos which explain the process for using Refine with a RDF extension to prepare subject heading linked data (yes, I’m sneaking a Monday Morning Metadata Movie into this post).   I was hoping to take my spreadsheet data of names and LCCN and get either VIAF or NAF URIs.  That would get our faculty names linked data project to the point of having 1/3rd of a triple.

Unfortunately, the I could not get the RDF plug-in installed on my laptop.  Some of my colleagues in the hands-on did manage to get it to work.  We pulled in some very knowledgeable programmers to troubleshoot and their conclusion after about an hour of tinkering was that the plug-in was buggy.    Back to the drawing board it seems.

There will be another Linked Data hands-on session at CODE4LIB 2012.  I anticipate that it will be as useful as the DLF hands-on.  I do plan on attending and I am keeping my fingers crossed that I can make some progress with our project.   There is a great list of resources on the web page for the DLF session.  There are other tools there besides Google Refine that I hope to play with before CODE4LIB.  Plus there are links to other hands-on tutorials.  Slowly but surely I’ll get my head wrapped around how to do Linked Data.  I’m grateful the DLF forum furthered my understanding.

 

Wherein I get stuck with library linked open data

Posted by laura on October 28, 2011 under Semantic web | Be the First to Comment

For the past few months I’ve been writing about our pilot Linked Open Data project to expose name identifiers for Caltech’s current faculty.  So far we’ve been working on creating/obtaining the initial data set.  I’m very happy to report that we’ve now got 372/412 faculty names in the LC/NAF and, by extension, the VIAF.   We expect to complete the set within the next month or so, give or take our other production responsibilities.  Meanwhile, I’ve been messing with the metadata so I can figure out what the heck to do next.

We have the data in full MARC21 authority records.  From that I’ve made a set of MADS records (thanks MarcEdit XSLT!).  I also created a tab delimited .txt file of the name heading and LCCN.

According to the basic linked data publishing pattern, it can be as simple as publishing a web page.  We are able to put out the structured data under an open license and in a non-proprietary format and call ourselves done.   This is what you would call 3  star linked open data.    We’d like to do a bit better than that.   In order to achieve 4 star Linked Open Data we need to do stuff like mint URIs and make the data available in forms more readily machine process-able.

This is where I get stuck.

Fortunately, there will be a hands-on linked data workshop at the DLF Forum (#dlfforum) next week.   I’m highly looking forward to it.  I’ve volunteered to give an overview of linked data to Maryland tech services librarians in December.   Getting our data out there will provide some necessary street cred.

 

Wikipedia authority template

Posted by laura on May 24, 2011 under Semantic web | Read the First Comment

There’s been some conversation lately about using Wikipedia in authority work.  Jonathan Rochkind recently blogged about about the potential of using Wikipedia Miner to do add subject authority information to catalog records, reasoning that the context and linkages provided in a Wikipedia article could provide better topical relevance.  Then somebody on CODE4LIB asked for help using author information in a catalog record to look up said author in Wikipedia on-the-fly.  The various approaches suggested on the list have been interesting although there hasn’t been an optimal solution.  Although I couldn’t necessarily code such an application myself, it’s good to know how a programmer could go about doing such a thing.  What I did learn was that Wikipedia has a way of marking up names with author identifiers. The Template:Authority Control gives an example of how to do it.

I haven’t done much authoring or editing at Wikipedia, so the existence of the “template” is news to me.  I think it’s pretty nifty, so I just had to blog it.  The template gets me thinking.   Perhaps we’ll be able to leverage our faculty names Linked Data pilot into some sort of mash-up with Wikipedia, pushing our author identifiers into that space or pulling Wikipedia info into our work.   Our group continues to make progress on getting all our current faculty are represented in the National Authority File, with an eye to exposing our set of authority records as Linked Data.  We haven’t figured out yet precisely what we’re going to do with the Linked Data once we make it available.  Build it and they will come is nice, but we need a demonstrable benefit (i.e. a cool project) to show the value of the Library’s author services.

VIAF already provides external links to Wikipedia and WorldCat Identities with its display of an author name. Ralph Levan explained how OCLC did it, in general fashion, in the CODE4LIB conversation. Near as I understand it, they do a data dump from Wikipedia, do some text mining, run their disambiguation algorithms over it, then add the Wikipedia page if they get a match. I don’t know if this computational approach is a Linked Data type of thing or not. I need to continue working my way through chapter 5 & chapter 6 Heath & Bizer’s Linked Data book (LOD-LAM prep!).  Nonetheless, it’s a good way of showing how connections can be built between an author identity tool and another data source which enrich the final product.   I have a hazy vision of morphing the Open Library’s “one web page for every book every published” into “one web page for every Caltech author.”  More likely it will be “one web page tool for every Caltech author to incorporate into their personal web site,” given the extreme individualism and independence cherished within our institutional culture.  But I digress.  Yes. “One web page for every Caltech author” would at least give us the (metaphorical) space to build a killer app.

Next steps faculty names as linked data

Posted by laura on April 26, 2011 under Semantic web, Standards | Be the First to Comment

I’m plugging away at getting a complete set of current Caltech faculty names into the LC NAF/VIAF .  I’ve already described what I’ve done so far to get our set of faculty names into a spreadsheet.  I had to put on my thinking cap to do the next steps.  I mentioned that we’re going to be creating the required records manually since we can’t effectively use a delimited-text to MARC translator to automate the process.  So how many of our names require original authority work? We had 741 names in our spreadsheet.  Some of the names could be eliminated very quickly. The list as-is contains adjunct faculty and staff instructors.  For our project we’re interested in tenure track faculty only.   It was a small matter to sort it and remove those not meeting the parameter.  This leaves 402 names for our complete initial working set.

The next step was to remove names from the list which already have authority records in the NAF.   It would have been efficient enough check the names one-by-one while we do the  authority work.    We would be searching the NAF anyway, in order to avoid conflicts when creating or editing records.  If a name on our list happened to be in the NAF, then we would move on to the next name.

Yet I chafe whenever there is any tedious one-by-one work to do.  The batch processing nerd in me wondered if there was a quick and dirty way to eliminate in-the-NAF-already names from our to-do list.  Enter OCLC Connexion batch searching mode.  I decided to run my list of names just for giggles.  This involved some jiggering of my spreadsheet so I could get a delimited text file that Connexion would take as input.  Thank you Excel concatenation formula!  I got results for 205 names, roughly half of the list.  There were some false positives (names which matched, but weren’t for a Caltech person) and some false negatives (names without matches but for famous people I know should have records already).  The false negatives were mostly due to form of name.  The heading in the authority record didn’t match the form of the name I had in the spreadsheet.

I compared my OCLC batch search results against the initial working set of names in my spreadsheet.  This wasn’t exactly an instant process.  I spent two working days reviewing it.   I’ve confirmed that 178 names are in the NAF.  I think this has saved a bit of time towards getting the project done.  We have four catalogers.  Let’s say each of them could go through five records per week in addition to their regular work.  It would take over 20 weeks to review the full list of 402 names. By reducing the number to 224, it would take roughly 11-12 weeks.   Much better!   I’d say my work days went to good use.  Now I can estimate a reasonable date for completing the set of  MARC authority records for our current faculty.    Let’s add a week or two (or four!)  for the “known-unknowns.”  Production fires always pop up and keep one from working on projects.  Plus summer is a time of absences due to conference attendance and vacation.  I think it’s do-able to get all of the records done by mid-August.  I would be thrilled if it our work was done by September.  We’re not NACO independent yet, so it will take a bit longer to add them to the NAF.  They need our reviewer’s stamp of approval.  Then we’ll be ready to roll with a Linked Data project.

It would be convenient if there was a way to batch search the NAF/VIAF, since the VIAF is already exposed as linked data.   I’m not aware of any such functionality so I’ve decided we should keep a local set of the MARC authority records.  I suspect it will make things simpler in the long run if we serve the data ourselves.  I also suspect that having a spreadsheet with the names and their associated identifiers will be useful (LCCN, ARN, and, eventually, VIAF number.) It may seem weird to keep database identifier numbers when one has the full MARC records.   I’ve learned, however, that having data in a delimited format  is invaluable for batch processing.   It takes a blink of an eye to paste a number into a spreadsheet when you’re reviewing or creating a full record anyway.  Sure, I could create a spreadsheet of the ARN and LCCN on-the-fly by extracting 001 and 010 fields from the MARC records.  But that’s time and energy.  If it’s painless to gather data, one should gather data.

We’ll be able to do something interesting with the records once we have exposed the full set as linked data (or we at least know the NAF or VIAF numbers so we can point to the names as linked data).  No, I don’t know yet what that something interesting will be.  I’m getting closer to imagining the possibilities though.  I’ve mentioned before that I get dazed and confused when faced with implementing a linked data project.   Two blog posts crossed my feed yesterday which cleared my foggy head a little.  Run, don’t walk, to Karen Coyle’s piece on Visualizing Linked Data and Ed Summers’ description of DOIs as linked data (btw, nice diagram Ed! ).   I’ve got more to say about my mini-ephiphany, but this post has gone on far too long already.  I’ll think-aloud about it here another day.

It’s the little things

Posted by laura on April 1, 2011 under Metadata, Standards | 4 Comments to Read

I’ve mentioned that we want to get authority records for all current Caltech faculty into the National Authority File and by extension into the VIAF.  The 1st step is to ensure that we have a current and comprehensive list of all faculty working here.  I’m happy to learn that I can easily obtain the information in a manipulate-able  form.  I was expecting that I would need ask  somebody in academic records and plead our case.  Lists can be tightly guarded by powers-that-be.  I just figured out that you can convert HTML tables to Excel via Internet Explorer.  That’s probably old news to most of you.  I’ve done .xls to html conversion, I’ve just never had the need to go in the opposite direction.  Plus I don’t use Internet Explorer.

I was able to create a spreadsheet of the necessary data by doing a directory search limited to faculty and running the conversion.  Sweet! Now we can divvy up the work and get cracking.  Getting the info is a small thing.  But it’s these little victories which make my days brighter.  I played around with the delimited text to MARC translator in MarcEdit to auto-generate records from the spreadsheet.  It worked like a charm.  Unfortunately the name info in the spreadsheet is collated within a single cell.  Also it’s in first name surname order without any normalization of middle initials, middle names, or nicknames in parens.  A text-to-MARC transform can only work with the data it is given.  A bunch of records with 1oo fields in the wrong order isn’t so helpful.   I messed about with the text-to-columns tool in Excel in order to parse the name data more finely, to no avail.  It worked but would require much post-split intervention to ensure the data is correct.   Might as well do that work within Connexion.

In fact, I’m ok with creating the authority records from scratch since we’re training to be NACO contributors.  People need the practice.   In my experience, it’s easier to do original cataloging  vs. using derived records.   Editing requires a finer eye and original work can be helped along with constant data and/or macros.   Regardless, it was fun to play with the transform and teach myself something new.  And it’s very exciting to take a step towards meeting our goal of authority/identity information/identifiers for our constituents.

Linked Open Data Libraries Archives Museums (LOD-LAM) Summit

Posted by laura on March 10, 2011 under Metadata, Semantic web, Standards | Be the First to Comment

I’m stoked. I’ve been accepted to the International Linked Open Data in Libraries, Archives and Museums Summit. From the about page, the summit: will convene leaders in their respective areas of expertise from the humanities and sciences to catalyze practical, actionable approaches to publishing Linked Open Data, specifically:

  • Identify the tools and techniques for publishing and working with Linked Open Data
  • Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata
  • Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions

It’s exciting because of its potential to spark real progress for library linked data.  I’m keen to be involved with projects where I can get my hands dirty.  I’m pretty much done with librarian conferences like ALA.  IMHO, ALA is an  echo chamber of how-we-done-it-good presentations and yet-another-survey research.  I went to an ERM presentation at the mid-winter meeting and heard a speaker discuss work flows that I’ve seen implemented in libraries for the past 13 years.  Seriously.  ALA is good for networking with fellow librarians to be sure but it isn’t the place to get bleeding edge information.  I’m ready to give my time and effort to breaking new ground.   I’m very fortunate that my boss is incredibly supportive of my LOD-LAM participation.

We want to do a linked data project with author identifiers for our faculty.  We’re a small institution.  We’ve got roughly 300 current faculty members which is a small enough number for us to create a complete set of records within a reasonable amount of time.  Our goal is to contribute our metadata to the commons and to share our experience as a use case.    I’m quite honored to be invited.  I’ve been following the work of some members of  the organizing committee for years and I’m very much looking forward to finally meeting them.