Hathitrust Research Center
Core Software Components
The current architecture of the Hathitrust Research Center cyberinfrastructure consists of a number of core software components.
HTRC Ingest Service
The Ingest Service is an internal service. Its main responsibility is to periodically get corpus data from Hathitrust, processing the data and pushing it to our repository. The Ingest Service consists of 3 main stages: rsync, ingestion, and verification.
HTRC gets its volume data from Hathitrust in Michigan. At Hathitrust, they have specialized hardware running Isilon File System, and organizes the corpus data on disks using Pairtree structure. Hathitrust also runs a rsync service for us to get the data as well as the periodic updates. Rysnc allows efficient updating of data by computing the differences between two copies and only send over the differences.
Since the Pairtree structure demands specialized and highly optimized hardware, HTRC does not serve data directly from the Pairtree. The Pairtree is only kept for the sole purpose of rsync.
HTRC uses Apache Cassandra to store and serve corpus data. After the Ingest Service has rsyncked the data from Hathitrust, it shifts to the ingestion stage. It first looks at the output generated during the rsync stage, which lists the files and directories that have been added or deleted. For these that have been deleted, the ingest code removes them from Cassandra as well. For these that are newly added, the ingest code first locates the structural metadata file in METS XML format, and parses the file to learn the structure of each volume. It then reads the volume data file (which is a ZIP file containing all pages as text files), and pushes the page content into Cassandra in the correct structural order. Finally, it updates Cassandra with some critical metadata information such as the copyright of the volume, number of pages the volume has, and the size and checksum of each page.
After the data has been pushed into Cassandra, the Ingest Service may optionally perform data verification to make sure the data is correct and has been ingested properly. The verification can be performed on the newly added data, or can be performed on all data in Cassandra. There are also 3 different levels of verification. The first and quickest one is to verify each volume has the same number of pages as its metadata claims. The second one is to verify the size of each page is the same as the metadata claims. The third and most thorough one is to verify that each page hashes to the correct checksum.
Code and Manuals
The code for the Ingest Service is available in our SVN repository on bitternut. Since it is an internal service and deals with the retrieval of digitized text some of which may be copyrighted or restricted, it is not publicly released.
Here is the HTRC Ingest Service Manual which shows how to install and configure the service, as well as how to use some of the tools