Speaking with authority

What we wanted: better false-positive filtering; geogname ids in addition to corp and persnames

How to get there: VIAF API, a bit of web scraping, and some more refined fuzzy string matching. Here's how we did it:

A quick intro to the VIAF API

OCLC offers a number of programmatic access points into VIAF's data, all of which you can see and interactively explore here. Since we're essentially doing a plain-text search across the VIAF database, the SRU search API seemed to be what we were looking for. Here is what an SRU search query might look like:

http://viaf.org/viaf/search?query=[search index]+[search type]+[search query]&sortKeys=[what to sort by]&httpAccept=[data format to return]

Or, split into its parts:

http://viaf.org/viaf/search
    ?query=[search index]+[match type]+[search query]
    &sortKeys=[what to sort by]
    &httpAccept=[data format to return]

There are a number of other parameters that can be assigned - this document gives a detailed overview of what exactly every field is, and what values each can hold. Here is a much condensed version:

  1. Search query: how and where to find the requested data. This is itself made up of three parts:
    1. Search index: what index to search through. Relevant options for our project are:
      • local.corporateNames: corporation names
      • local.geographicNames: geographic locations
      • local.personalNames: names of people
      • local.sources: which authority source to search through. "lc" for Library of Congress.
    2. Match type: how to match the items in the search query to the indicated search index -- e.g. exact("="), any of the terms in the query ("any"), all of the terms ("all"), etc.
    3. Search query: the text to search for
  2. Sort keys: what to sort the results by. At the moment, VIAF can only sort by holdings count ("holdingscount").
  3. httpAccept: what data format to return the results in. We want the xml version ("application/xml")

The neat thing about this is that the available indexes match very nicely with the controlaccess types we're trying to retrieve authority data for: corpnames, geographic names, and agent names. For example, if we wanted to search for Jane Austen:

http://viaf.org/viaf/search
    ?query=local.personalNames+all+"Jane Austen"
    &sortKeys=holdingscount
    &httpAccept=application/xml

Workflow:

  1. Query VIAF with the given term
  2. If there's a match, grab the LoC auth id
  3. Use the LoC web address to grab the authoritative version of the entity's name.
  4. Somehow compare the original entity string to the returned LC value. If the comparison fails, then we treat the result as a false positive.

First, we wrote an interface to the VIAF API in python, using the built-in urllib2 library to make the web requests and lxml to parse the returned xml metadata. That code looked something like this:

With the VIAF search results in hand, we grabbed the address and metadata for the first, presumably most relevant result, from the returned xml. All sorts of interesting stuff can be found in that data, but for our immediate purposes we were only interested in the Library of Congress ID:

Now that we had the LC auth ID, we could query the Library of Congress site to grab the authoritative version of the term's name. Here we used BeautifulSoup:

Now we had a csv file with four data points for every term: Our original name, an unvetted LoC ID number and name, and the type of controlaccess term the item belongs to. As before, there were a number of obvious false-positives, but there were enough terms that we did not have nearly enough time to check through them individually. As Max hinted at earlier in his post, this was fuzzywuzzy's time to shine.

[to write]

[bit about updating persnames with new death-dates]

Weaknesses:

Banks entirely on the assumption that the first VIAF search result is the correct one. Our fuzzy comparisons do a lot to mitigate this, but since VIAF seems to sort its results by number of ____ rather than by exact match, this occasionally led to hilarious results. See: Michael Jackson for Stevie Wonder, or Turkey for Texas. The correct items do appear somewhere in the returned search results, just not at the top position.

Slow - since we don't want to inadvertently DDOS any of the sites we're querying, we needed to set a delay in between each request. When running checks against >10,000 items, even just a one second delay adds up.