I’m stoked about the work I’ve been doing lately. I finally get to do some hands-on production work with metadata standards beyond AACR2/RDA/MARC, DC, and EAD. I taught myself XML in the late 90s and I’ve spent about a decade of avidly following the progress of METS, MODS, PREMIS, without any practical application. Learning in a vacuum sucks. Getting neck-deep into a project gives me a firmer grasp of concepts.
We’re currently re-vamping our Archives systems architecture. This has involved a year of analyzing workflow and current system functions, scoping our the functional requirements of what we need our systems to do, and evaluating the various solutions available. Our integrated archival management system is a bespoke FileMaker Pro database (well, actually several databases). It ties together patron management, financial management, archival description/EAD generation, digital object management, and web/interface layer. It worked well for us, but our system is 20 years old and won’t scale to handle more complex digital archiving. Ideally a new system for Archives would be as “integrated,” as our custom developed system. Unfortunately, no such Integrated Archival System exists.
We decided to go with ArchivesSpace for archival description, Aeon for patron/financial management, and Fedora/Islandora for our digital asset management system (DAMS). A unified interface layer wasn’t a critical component for us in the near-term. We figure we can do something after the other components are implemented. The strategic goal for us was to create an architecture with what I call the four systems virtues: extensibility, interoperability, portability, and scalability. We wanted to future-proof ourselves as much as possible.
We’re implementing Fedora/Islandora in our first phase. We’re starting by migrating our small collection of 10,000 digitized images from FileMaker Pro to Islandora, with the help of the folks from Discovery Garden. We’re in the metadata mapping stage, making decisions on schema structure and indexing/searching/display functions. We’re considering using a modified MODS schema with some local and VRA Core elements. I’ve need to quickly climb a relatively steep learning curve. First, I don’t have a detailed knowledge of data content standards for cataloging images. It’s not a medium one typically handles in science & engineering. I’ve been reacquainting myself with Cataloging Cultural Objects (CCO) so I can help our archivists with descriptive data entry. A knowledge of data content should inform choices of data structure and format (i.e. which elements of MODS and VRA Core to have in our schema). Second, I’m learning the Fedora “digital object” model and how it relates to Islandora functionality. Digital objects in this context is not digital objects in the librarian sense, a label for born-digital/digitized content. Third, I’m simultaneously considering the crosswalk between our legacy records and MODS while specifying our future image descriptive cataloging needs.
My biggest philosophical brain effort right now is figuring out how to implement best practice in image cataloging within Islandora/Fedora. Per CCO/VRA Core, there should be a clear distinction between the work and the image (analogous to FRBR work and expression/manifestation). In theory, this is can be done by having two records and relating them via a wrapper metadata like METS, or by using “related item” elements within the descriptive metadata schema. In practice, I simply don’t know how Islandora can manage it. Fedora was made for this type of thing, fortunately, so I assume that it’s possible. Obviously I’ll be looking at the work of others to inform our choices in the metadata structure.
Fortunately, I consider this fun.
I’m back from the International Linked Open Data in Libraries Archives and Museums Summit which was held June 2-3, 2011 San Francisco, CA, USA. My brain is still digesting all that I learned. I’ve posted my rough notes in case anybody else can find them useful. I thought the event was really well done. I learned about several LOD projects which might provide tools we can use here at Caltech. I’ve got to dig in a bit more detail. The organizers will be posting a summit report with all of the action items for next steps. Of particular interest to me – there will be some work on publishing citations as linked data and there will be some materials released which can assist with educating the LAM community about LOD. I’ll probably write about each aspect as more information comes to light and as I wrap my head around it.
Re: my post yesterday saying I was unsure if the VIAF text mining approach to incorporating Wikipedia links within their records was Linked Data. There’s a good little conversation over at the LOD-LAM blog which elucidated the difference for me. They say it better than I can, so go have a look-see. The money quotation, ” Linked Data favors “factoids” like date-and-place-of-birth, while statistical text-mining produces (at least in this case) distributions interpretable as “relationship strength”.”
Another librarian has seen the Linked Data light. Mita Williams, the New Jack Librarian, writes about gaining a new appreciation for LOD at the recent Great Lakes THAT camp. Her take-away seems similar to my understanding: librarians already know how to created Linked Data. We need to see the application of the Linked Data in new contexts in order to comprehend the utility of exposing the data. The tricky bit IMHO is that creating applications to use the data requires a SPARQL end point. These SPARQL end points aren’t geared for humans. They are a “machine-friendly interface towards a knowledge base.”
I think the machine application layer of Linked Data is where librarians hit a barrier when getting involved with Linked Open Data (LOD). I don’t have the first clue how to set up a SPARQL end point. My technical expertise isn’t there and I’m sure there are a lot of people in the same boat (CODE4LIBers notwithstanding). Most of the stuff I’ve read about getting libraries more involved in LOD has focused on explaining how RDF is done in subject predicate object syntax then urging libraries to get their metadata transformed into RDF. I’ve seen precious little plain English instruction on building an app with Linked Data. I have seen great demos on nifty things done by people in library-land. I’ll give a shout out here to John Mark Ockerbloom and his use of id.loc.gov to enhance the Online Books Page. John Mark Ockerbloom has a PhD in computer science. How do the rest of us get there?
Personally, I’m working with the fine folks here to get our metadata in a ready to use Linked Data format. And I’m plowing through the jargon laden documentation to teach myself next steps. Jon Voss, LOD-LAM summit organizer, has posted a reading list to help and soliciting contributions. The first title I’m delving into is Heath & Bizer’s Linked Data: Evolving the Web into a Global Data Space which has a free HTML version available. They include a groovy little diagram which outlines the steps in the process of “getting there.” I’m heartened to see that our 1st step (getting the data ready) reflects the 1st step in the diagram.
I’ve been humming the Johnny Nash song to myself ever since reading Karen Coyle’s blog post on Visualizing Linked Data and Ed Summers’ blog post on DOIs as Linked Data. Thanks to them I think I’ve finally conceptualized the “so-what” factor for Linked Data. It’s the mash-ups stupid! The key to doing something useful with Linked Data is being able to build a web page that pulls together information via various bits of linked data. Pick and choose your information according to your need!
Likening Linked Data applications to mash-ups is probably a bit over simplistic. It’s also the pearl diving stupid! Pearl diving is my term for how a machine could, in theory, traverse from link to link to link in order to mine information. Ed’s example of taking a citation, linking to journal information, then linking to Library of Congress shows how a piece of code could crawl and trawl. But how wide a net to cast and how deep to throw it? A bit of programming is in in order to mash-up Linked Data streams effectively. I read Berners Lee on Linked Data over and over and over and couldn’t see what the big deal was about creating chains of metadata. The chains are infrastructure. The value is in what you choose to hang on those chains. Ed’s diagrams and Karen’s explanations finally got this through my thick skull. I invite commenters to correct me if my understanding is flawed.
Building a web page from various streams of data isn’t as simple as it seems on the surface. Also, any Linked Data service one might develop wouldn’t be a single web page. It would be some sort of search and retrieval tool which created results pages on-the-fly. One has to know where there are data stores and what they contain. One has to have some code, or bot, or what-not which does the search and retrieval and plugs the information into the resulting display. And one has to have an overall vision of what data can be combined to create something which is larger than the sum of its parts. I think this is a tall order for libraries, archives, and museums with small staffs with variable technical resources. We’re used to dealing with structured data so it’s a small leap to conceive of meddling with the data a bit to expose it in the Linked Data way. It’s a big old pole vault, however, to move from I-know-some-HTML-and-CSS to programming a retrieval system that pulls information from various quarters and presents it in meaningful interface. That’s where the programming knowledge comes in.
I’ve struggled for a long time trying to figure out where to define role boundaries between metadata librarians and programmers. My nascent understanding of Linked Data services has led me to a rough take: Metadata librarians create Linked Data or update legacy metadata to Linked Data. Programmers create or implement the search & retrieval and interface tools. Somewhere in-between is the systems analyst role. The analyst figures out which bits of linked-data would work well together to make a service that meets customer needs. Librarians and programmers probably share the systems analyst role. Things get fuzzy when the library/archive/museum is a small operation or one-man band. We’re very fortunate to have awesome programmers here. Together we can take our knowledge of our primary end-user and create a useful (and hopefully well used) service with the Linked Data we’re creating via our faculty names project.
I’m plugging away at getting a complete set of current Caltech faculty names into the LC NAF/VIAF . I’ve already described what I’ve done so far to get our set of faculty names into a spreadsheet. I had to put on my thinking cap to do the next steps. I mentioned that we’re going to be creating the required records manually since we can’t effectively use a delimited-text to MARC translator to automate the process. So how many of our names require original authority work? We had 741 names in our spreadsheet. Some of the names could be eliminated very quickly. The list as-is contains adjunct faculty and staff instructors. For our project we’re interested in tenure track faculty only. It was a small matter to sort it and remove those not meeting the parameter. This leaves 402 names for our complete initial working set.
The next step was to remove names from the list which already have authority records in the NAF. It would have been efficient enough check the names one-by-one while we do the authority work. We would be searching the NAF anyway, in order to avoid conflicts when creating or editing records. If a name on our list happened to be in the NAF, then we would move on to the next name.
Yet I chafe whenever there is any tedious one-by-one work to do. The batch processing nerd in me wondered if there was a quick and dirty way to eliminate in-the-NAF-already names from our to-do list. Enter OCLC Connexion batch searching mode. I decided to run my list of names just for giggles. This involved some jiggering of my spreadsheet so I could get a delimited text file that Connexion would take as input. Thank you Excel concatenation formula! I got results for 205 names, roughly half of the list. There were some false positives (names which matched, but weren’t for a Caltech person) and some false negatives (names without matches but for famous people I know should have records already). The false negatives were mostly due to form of name. The heading in the authority record didn’t match the form of the name I had in the spreadsheet.
I compared my OCLC batch search results against the initial working set of names in my spreadsheet. This wasn’t exactly an instant process. I spent two working days reviewing it. I’ve confirmed that 178 names are in the NAF. I think this has saved a bit of time towards getting the project done. We have four catalogers. Let’s say each of them could go through five records per week in addition to their regular work. It would take over 20 weeks to review the full list of 402 names. By reducing the number to 224, it would take roughly 11-12 weeks. Much better! I’d say my work days went to good use. Now I can estimate a reasonable date for completing the set of MARC authority records for our current faculty. Let’s add a week or two (or four!) for the “known-unknowns.” Production fires always pop up and keep one from working on projects. Plus summer is a time of absences due to conference attendance and vacation. I think it’s do-able to get all of the records done by mid-August. I would be thrilled if it our work was done by September. We’re not NACO independent yet, so it will take a bit longer to add them to the NAF. They need our reviewer’s stamp of approval. Then we’ll be ready to roll with a Linked Data project.
It would be convenient if there was a way to batch search the NAF/VIAF, since the VIAF is already exposed as linked data. I’m not aware of any such functionality so I’ve decided we should keep a local set of the MARC authority records. I suspect it will make things simpler in the long run if we serve the data ourselves. I also suspect that having a spreadsheet with the names and their associated identifiers will be useful (LCCN, ARN, and, eventually, VIAF number.) It may seem weird to keep database identifier numbers when one has the full MARC records. I’ve learned, however, that having data in a delimited format is invaluable for batch processing. It takes a blink of an eye to paste a number into a spreadsheet when you’re reviewing or creating a full record anyway. Sure, I could create a spreadsheet of the ARN and LCCN on-the-fly by extracting 001 and 010 fields from the MARC records. But that’s time and energy. If it’s painless to gather data, one should gather data.
We’ll be able to do something interesting with the records once we have exposed the full set as linked data (or we at least know the NAF or VIAF numbers so we can point to the names as linked data). No, I don’t know yet what that something interesting will be. I’m getting closer to imagining the possibilities though. I’ve mentioned before that I get dazed and confused when faced with implementing a linked data project. Two blog posts crossed my feed yesterday which cleared my foggy head a little. Run, don’t walk, to Karen Coyle’s piece on Visualizing Linked Data and Ed Summers’ description of DOIs as linked data (btw, nice diagram Ed! ). I’ve got more to say about my mini-ephiphany, but this post has gone on far too long already. I’ll think-aloud about it here another day.
I’ve mentioned that we want to get authority records for all current Caltech faculty into the National Authority File and by extension into the VIAF. The 1st step is to ensure that we have a current and comprehensive list of all faculty working here. I’m happy to learn that I can easily obtain the information in a manipulate-able form. I was expecting that I would need ask somebody in academic records and plead our case. Lists can be tightly guarded by powers-that-be. I just figured out that you can convert HTML tables to Excel via Internet Explorer. That’s probably old news to most of you. I’ve done .xls to html conversion, I’ve just never had the need to go in the opposite direction. Plus I don’t use Internet Explorer.
I was able to create a spreadsheet of the necessary data by doing a directory search limited to faculty and running the conversion. Sweet! Now we can divvy up the work and get cracking. Getting the info is a small thing. But it’s these little victories which make my days brighter. I played around with the delimited text to MARC translator in MarcEdit to auto-generate records from the spreadsheet. It worked like a charm. Unfortunately the name info in the spreadsheet is collated within a single cell. Also it’s in first name surname order without any normalization of middle initials, middle names, or nicknames in parens. A text-to-MARC transform can only work with the data it is given. A bunch of records with 1oo fields in the wrong order isn’t so helpful. I messed about with the text-to-columns tool in Excel in order to parse the name data more finely, to no avail. It worked but would require much post-split intervention to ensure the data is correct. Might as well do that work within Connexion.
In fact, I’m ok with creating the authority records from scratch since we’re training to be NACO contributors. People need the practice. In my experience, it’s easier to do original cataloging vs. using derived records. Editing requires a finer eye and original work can be helped along with constant data and/or macros. Regardless, it was fun to play with the transform and teach myself something new. And it’s very exciting to take a step towards meeting our goal of authority/identity information/identifiers for our constituents.
I’m stoked. I’ve been accepted to the International Linked Open Data in Libraries, Archives and Museums Summit. From the about page, the summit: will convene leaders in their respective areas of expertise from the humanities and sciences to catalyze practical, actionable approaches to publishing Linked Open Data, specifically:
- Identify the tools and techniques for publishing and working with Linked Open Data
- Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata
- Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions
It’s exciting because of its potential to spark real progress for library linked data. I’m keen to be involved with projects where I can get my hands dirty. I’m pretty much done with librarian conferences like ALA. IMHO, ALA is an echo chamber of how-we-done-it-good presentations and yet-another-survey research. I went to an ERM presentation at the mid-winter meeting and heard a speaker discuss work flows that I’ve seen implemented in libraries for the past 13 years. Seriously. ALA is good for networking with fellow librarians to be sure but it isn’t the place to get bleeding edge information. I’m ready to give my time and effort to breaking new ground. I’m very fortunate that my boss is incredibly supportive of my LOD-LAM participation.
We want to do a linked data project with author identifiers for our faculty. We’re a small institution. We’ve got roughly 300 current faculty members which is a small enough number for us to create a complete set of records within a reasonable amount of time. Our goal is to contribute our metadata to the commons and to share our experience as a use case. I’m quite honored to be invited. I’ve been following the work of some members of the organizing committee for years and I’m very much looking forward to finally meeting them.
I’m looking at incentives for making our serials holdings MARC standard compliant. MARC Holdings Format Data, pronounced “muffed” I’m told, isn’t supported very well within our ILS. MFHD is held within check-in records. It makes sense to a degree. One needs coverage ranges when checking in journals. The data is buried, however, in a place where most people using the ILS will not see it. Customers or staff. We would love to get it current, correct, and usable.
The biggest reason for standardizing is to make interlibrary loan work smoother. We get requests for “titles-not-owned” when OCLC indicates we own a journal but we don’t have a specific issue. This brings down our fulfillment rate. That makes us naughty players in the shared resources game. But what are the consequences of that? I’m not quite sure at the moment. Patrons beyond Caltech are important to us, absolutely. Yet they fall lower in our priority queue than Caltech faculty, staff, and students. When resources are limited we focus on projects with the biggest payoffs for our primary user group.
There are other good reasons for standardizing. Machines manipulate standardized data better. It’s a metadata truism. Let’s ignore the real-world issues with interoperability that have been demonstrated over the years. Those are really a result of human factors. We all know that standardized data is not truly standardized. See Naomi Dushay and Diane Hillmann’s excellent identification problems encountered in sharing Dublin Core records. But let’s live in an ideal world for a minute and say that we did get our data nice and clean and in a standardized format. All of a sudden we would have the means to re-use our data outside of our ILS. Theoretically at least. Much depends on the export capacity of our ILS.
It would be lovely if we could better automate maintenance of coverage ranges within our OpenURL resolver, for example. I’m sure there are more rationales for holdings standardization that I haven’t thought about. I’ve begun reviewing the literature. We can’t make a decision to do a large conversion project based on all of these feel-good reasons, however. The business case relies upon multiple factors: the state of our current data, the capacities of our ILS, the interoperability of our ILS and OCLC, and our staffing and budgetary resources. All of these need thorough analysis. So we’re holding on holdings at present while we gather information and ask hard questions. Ultimately it comes down to answering the question, will the payoff be worth the investment? Stay tuned.