Next steps faculty names as linked data

Posted by laura on April 26, 2011 under Semantic web, Standards | Be the First to Comment

I’m plugging away at getting a complete set of current Caltech faculty names into the LC NAF/VIAF .  I’ve already described what I’ve done so far to get our set of faculty names into a spreadsheet.  I had to put on my thinking cap to do the next steps.  I mentioned that we’re going to be creating the required records manually since we can’t effectively use a delimited-text to MARC translator to automate the process.  So how many of our names require original authority work? We had 741 names in our spreadsheet.  Some of the names could be eliminated very quickly. The list as-is contains adjunct faculty and staff instructors.  For our project we’re interested in tenure track faculty only.   It was a small matter to sort it and remove those not meeting the parameter.  This leaves 402 names for our complete initial working set.

The next step was to remove names from the list which already have authority records in the NAF.   It would have been efficient enough check the names one-by-one while we do the  authority work.    We would be searching the NAF anyway, in order to avoid conflicts when creating or editing records.  If a name on our list happened to be in the NAF, then we would move on to the next name.

Yet I chafe whenever there is any tedious one-by-one work to do.  The batch processing nerd in me wondered if there was a quick and dirty way to eliminate in-the-NAF-already names from our to-do list.  Enter OCLC Connexion batch searching mode.  I decided to run my list of names just for giggles.  This involved some jiggering of my spreadsheet so I could get a delimited text file that Connexion would take as input.  Thank you Excel concatenation formula!  I got results for 205 names, roughly half of the list.  There were some false positives (names which matched, but weren’t for a Caltech person) and some false negatives (names without matches but for famous people I know should have records already).  The false negatives were mostly due to form of name.  The heading in the authority record didn’t match the form of the name I had in the spreadsheet.

I compared my OCLC batch search results against the initial working set of names in my spreadsheet.  This wasn’t exactly an instant process.  I spent two working days reviewing it.   I’ve confirmed that 178 names are in the NAF.  I think this has saved a bit of time towards getting the project done.  We have four catalogers.  Let’s say each of them could go through five records per week in addition to their regular work.  It would take over 20 weeks to review the full list of 402 names. By reducing the number to 224, it would take roughly 11-12 weeks.   Much better!   I’d say my work days went to good use.  Now I can estimate a reasonable date for completing the set of  MARC authority records for our current faculty.    Let’s add a week or two (or four!)  for the “known-unknowns.”  Production fires always pop up and keep one from working on projects.  Plus summer is a time of absences due to conference attendance and vacation.  I think it’s do-able to get all of the records done by mid-August.  I would be thrilled if it our work was done by September.  We’re not NACO independent yet, so it will take a bit longer to add them to the NAF.  They need our reviewer’s stamp of approval.  Then we’ll be ready to roll with a Linked Data project.

It would be convenient if there was a way to batch search the NAF/VIAF, since the VIAF is already exposed as linked data.   I’m not aware of any such functionality so I’ve decided we should keep a local set of the MARC authority records.  I suspect it will make things simpler in the long run if we serve the data ourselves.  I also suspect that having a spreadsheet with the names and their associated identifiers will be useful (LCCN, ARN, and, eventually, VIAF number.) It may seem weird to keep database identifier numbers when one has the full MARC records.   I’ve learned, however, that having data in a delimited format  is invaluable for batch processing.   It takes a blink of an eye to paste a number into a spreadsheet when you’re reviewing or creating a full record anyway.  Sure, I could create a spreadsheet of the ARN and LCCN on-the-fly by extracting 001 and 010 fields from the MARC records.  But that’s time and energy.  If it’s painless to gather data, one should gather data.

We’ll be able to do something interesting with the records once we have exposed the full set as linked data (or we at least know the NAF or VIAF numbers so we can point to the names as linked data).  No, I don’t know yet what that something interesting will be.  I’m getting closer to imagining the possibilities though.  I’ve mentioned before that I get dazed and confused when faced with implementing a linked data project.   Two blog posts crossed my feed yesterday which cleared my foggy head a little.  Run, don’t walk, to Karen Coyle’s piece on Visualizing Linked Data and Ed Summers’ description of DOIs as linked data (btw, nice diagram Ed! ).   I’ve got more to say about my mini-ephiphany, but this post has gone on far too long already.  I’ll think-aloud about it here another day.

Add A Comment