Digital Collections As Data

Ana Lucic, Digital Scholarship Librarian at DePaul and co-PI, writes:

Historically, circulation data and the demographic characteristics of library branch patrons have been used to analyze reading patterns and behaviors of different audiences. To date, our “Reading Chicago Reading” project has taken advantage of the circulation data we received from the Chicago Public Library and from the American Community Survey to shed light on the reception of different OBOC selections throughout diverse Chicago neighborhoods. See e.g. our earlier post.

We are also interested in the text characteristics of the chosen OBOC books: stylistic features of the OBOC choices, and those features’ relations to other books in the library system. How are OBOC selections different or similar to other books that circulate throughout the Chicago Public Library system? This kind of question, however, is not easy to answer in isolation. Analyzing trends of this kind requires a larger collection of literature. And this is the reason why we have been keen to start interacting with the recently released HathiTrust Extracted features dataset.

The HathiTrust digital library is a notable example of a digital collection amenable to computational analysis, the theme of a workshop at the upcoming DH2017 conference that will take place in Montreal, Canada from August 8-11 (where, by the way, members of the “Reading Chicago Reading” project will present). In Hathi, not only have the works of literature been scanned and made available to users but the content of the books in the digital library has been turned into data that can be used, reused, and merged with other types of data to be analyzed in aggregate.

Of particular interest to “Reading Chicago Reading” is how the OBOC selections compare to other books in the library system? The HathiTrust extracted features dataset includes features extracted from millions of books in the HathiTrust digital library, making it a perfect resource for our comparative needs. And yet, the path to fulfilling this particular information need has not been obstacle-free.

I will dwell here on one particular issue we have experienced: the nature of interaction with the extracted features dataset and the ability to establish whether a specified set of books is available in the dataset. Like more traditional library and information resources that are geared towards the known-item search, the HathiTrust extracted features dataset requires a particular HathiTrust volume id to establish whether the features from a particular book are included in the dataset. We have found, however, that establishing whether a set of works is available through the HTRC is less straightforward.

To give a concrete example, we have been trying recently to establish whether twenty-seven fiction and non-fiction works recommended by CPL in association with the OBOC 2015 selection, Thomas Dyja’s The Third Coast, are in the HathiTrust extracted features dataset. The problem with this seemingly simple query is that any English language edition of those twenty-seven works is of interest to us. Put differently, we are not looking for a particular edition of a given work, we are interested in any English language book edition of that work. Although the individual differences between editions matter a lot to book historians and literary critics, when approaching books as data for our city-scale purposes, these individual differences pale in significance. People around the city will read the same text – but in any number of editions and formats.

And this brings us to the gist of the problem: verifying whether any edition of a particular work (in the FRBR — Functional Requirements for the Bibliographic Records — sense of the word) is in the Hathi dataset implies having access to and being able to retrieve all the book work ids (for example, OCCL owis) and all the book manifestations (again in the FRBR sense of the word) of a particular work, and, in addition, establishing the HathiTrust volume id for a particular manifestation. Granted, HathiTrust provides an API search of their extracted features dataset using author and title fields; however, this search does not guarantee that all the author name and title variants have been covered.

This is an open research question. If you have suggestions, please let us know.

On May 11, at the ALCTS Exchange, we will present a workflow that we have constructed with the help of Tami Luedtke, Technical Services Coordinator at DePaul University Library, predictive analytics graduate student Dan Aasland, and programmer Jeremy Ervin. The workflow begins with a known work, for example, The Adventures of Augie March by Saul Bellow, obtains all the OCLC owis for the work, transforms them into individual manifestations and finally into HathiTrust volume ids, the output of which can serve as an input to the extracted features dataset. The PowerPoint presentation slides will be posted after the ALCTS Exchange. Stay tuned for more information.


Leave a Reply

Your email address will not be published. Required fields are marked *