HathiTrust Coverage Project

Novel Coverage Project

Introduction

This project developed from the goal to complete a document similarity project that would compare the literary texts of novelists of different races, genders, and genres (focusing on authors of general fiction or speculative fiction).  The objective was to identify which of these factors had the strongest influence on the writing, and we intended to obtain the texts for this research project from the HathiTrust database. After querying the database, we noticed coverage gaps for our authors of interest, as there were cases in which fewer than fifty-percent of the expected novels were present.  As a result, we pivoted the direction of our study to compare the coverage of our selected authors’ novels across four resources: World Cat, IU Cat, the Monroe County Public Library, and HathiTrust. With this study, we want to encourage researchers to investigate the coverage of categories of interest – be they genre, gender, date of publication – prior to conducting analyses, so they are aware of the potential impact of coverage gaps on their final results and conclusions.  Furthermore, we hope this study leads to collaborative efforts to improve the coverage available to researchers in resources such as the HathiTrust.

Authors in the study

In total, there were forty authors selected for this study, distributed across the following three categories: Gender (Male/Female), Race (African-American/White), and Genre (Fiction Author/Speculative Fiction Author).  MORE ON HOW AUTHORS WERE CHOSEN

African-American Fiction
FemaleMale
– Eleanor Taylor Bland
– Zora Neale Hurston
– Gloria Naylor
– Toni Morrison
– Alice Walker
– James Baldwin
– Charles Chesnutt
– Ernest J. Gaines
– Ishmael Reed*
– Richard Wright
African-American
Speculative Fiction
FemaleMale
– Octavia Butler
– Tananarive Due*
– Nalo Hopkinson
– N. K. Jemisin
– Nnedi Okorafor
– Steven Barnes
– Samuel Delany*
– Minister Faust
– Victor LaValle*
– Walter Mosley*
White Fiction
FemaleMale
– Louisa May Alcott
– Joan Didion
– Nadine Gordimer
– Doris Lessing*
– Iris Murdoch
– J.M. Coetzee
– William Faulkner
– Ernest Hemingway
– Sinclair Lewis
– John Steinbeck
White Speculative Fiction
FemaleMale
– Margaret Atwood*
– C. J. Cherryh
– Kameron Hurley
– Ursula K. Le Guin*
– Anne McCaffrey*
– Isaac Asimov*
– Arthur C. Clarke*
– Philip K. Dick*
– Robert Heinlein
– Neal Stephenson*

* This author wrote both general fiction and speculative fiction, but has been classified in the category above due to most of their novels and/or their most renowned works belonging to the specified category.  

Methods

Generating Lists of Novels

For each author in the study, a list of his/her novels was generated from the NoveList database via a query of each author; these results were supplemented by the Wikipedia bibliographic records for the authors.  

Identifying Novel Genres

The genre labels from each title were obtained primarily from the Library of Congress database; if the genre designation was not available for a particular title in the Library of Congress, we searched the following sources in this order: NoveList, Goodreads, Wikipedia, Amazon, WorldCat.  After the dataset of authors and titles was prepared, the bibliographic records for each of the following databases were queried and parsed to identify what percentage of the target titles were held by each of these sources in print/e-book form.  

Identifying Birth/Death Dates of Authors

The first resource used for identifying the year of birth (and death, if the author is deceased) was the IU Cat database.  If this information was not available in IU Cat, Wikipedia was used.

Databases Queried

  1. World Cat; date of retrieval: 10-09-2019 (added authors: 11-13-2019)  
  2. IU Cat; date of retrieval: 10-15-2019 (added authors: 11-14-2019)
  3. Monroe County Public Library; date of retrieval: 10-02-2019
  4. HathiTrust; date of retrieval: 09-01-2019

The World Cat and IU Cat records were queried using their respective APIs (namely, https://platform.worldcat.org/api-explorer/apis/wcapi/Bib/OpenSearch and http://iucat-api.uits.iu.edu/?apikey=2f6e2b71-dae0-453d-a580-48da6f6221ca&_=1477406893999), the Monroe County Public Library supplied their bibliographic records in XML file format, and the HathiTrust collection records were obtained from the HathiFiles resource (https://www.hathitrust.org/hathifiles) offered by the HathiTrust.  All of these data collections were either in XML or JSON format.  Python scripts were used to parse the data for the authors in our dataset and titles of interest.  To verify that titles were not missed in the parsing process, the unmatched titles were manually checked in the databases.  After parsing the data from these four sources and completing the verification process, the presence or absence of each of the titles in our study was recorded for all four of the sources.  

Results

We present the results of our study with the following three tools: a filterable table of coverage, an interactive plot of coverage by author, and an animation of the title coverage by source.

Filterable Table of Coverage by Author:

This table summarizes the coverage of each author’s work with both counts and percentages of works held in each of the databases.    

Plot of Coverage Percentage by Author:

The visualization below presents the percentage of total novels in the dataset covered by author in each of the four sources: World Cat, IU Cat, Monroe County Public Library (MCPL), and Hathi Trust.  Each marker represents one author. The larger the size of the marker, the more novels written by this author. The race of the authors is represented by shape (a triangle for African-American authors and a square for white authors), and the genre designation of the authors is represented by the color of the markers.  Female authors are displayed on the left and male authors on the right.

These charts are interactive:

  1. HOVER: Hovering your mouse over a given data point will reveal the author’s name along with the total number of novels and the total number of these titles in the given database.  
  2. FILTER BY AUTHOR: You can filter the charts by author by selecting the colored box next to the authors name in the right vertical menu; if you would like to select multiple authors at the same time, hold down the shift key (Windows) as you select.  
  3. FILTER BY GENRE/RACE: You can also filter by genre and/or race by using the table to the right of the charts; once again, to select multiple categories, hold down the shift key (Windows) as you select.     
  4. NOTE: After applying either the author or genre/race filter, it is recommended that you double-click on the filter to reset it before filtering again.    

Animation of Title Coverage by Gender, Race, and Genre:

In all the slides in the animation below, each marker represents a novel; the circular markers represent general fiction novels and the v-shaped markers represent speculative fiction novels.  The novels are clustered by the author’s race and gender. The first slide in the animation shows the 667 novels in the dataset. The four subsequent slides display which of these novels are held by the four sources reviewed in this analysis, respectively; if a novel is absent from a given source, the marker appears semi-transparent.  This animation can be paused to study the coverage of any of the particular sources.

Observations

As can be seen in the tools above, the World Cat database holds all of the target titles with one exception; the high percent of coverage by this source is to be expected for this resource. The distribution of the number of authors with 50% or less coverage in the categories of gender, genre, and race across all four data sources is presented in the table below.  

Counts of Authors with 50% or less coverage


GENDERGENRERACETOTAL
SOURCEMale AuthorsFemale AuthorsSpeculative Fiction AuthorsGeneral Fiction AuthorsBoth Genre AuthorsAfrican-American AuthorsWhite AuthorsNum. of Authors with 50% or less coverage
World Cat00
000000
IU Cat24510426
MCPL7849210515
HathiTrust5981510414

The above table reveals that considering the authors with 50% or less coverage, the Monroe County Public Library has the largest total number (15), followed by HathiTrust (14), then IU Cat (6), and World Cat (0).  Regarding gender, the table shows us that the HathiTrust collection has the highest number of female authors (9) with 50% or less coverage; also, the Hathi Trust has a greater ratio of female to male authors (9 female: 5 male) with 50% or less coverage than that of the Monroe County Public Library (8 female: 7male).  With respect to genre, IU Cat has more speculative fiction authors (5) than general fiction authors (1) with 50% or less coverage, MCPL has more general fiction authors (9) than speculative fiction (4) and both genre (2) authors combined with 50% or less coverage, and HathiTrust has more speculative fiction authors (8) and both genre authors (5) than general fiction authors (1) with 50% or less coverage.  Regarding race, the ratio of African-American to White authors with 50% or less coverage is 4:2 in IU CAT, 10:5 in MCPL, and 10:4 in HathiTrust.    

Conclusion

This study underscores the importance of understanding the content of an extracted dataset from a database.  Without this investigative step, research may be conducted on an unknowingly biased or incomplete dataset. Automated processes for obtaining data from a database can conveniently provide data for analysis; however, problematic assumptions of completeness often accompany mass data extractions.  Therefore, researchers must be aware of the strengths and weaknesses of their data sources, and additional and collaborative efforts should be taken to identify and address the current shortcomings of research databases.