What is big, strong, secure, and has a very long memory? If you answered 'elephant', you’re not far off. The answer is the HathiTrust Research Center (HTRC). Hathi (pronounced: hottie) is the Hindi word for elephant, and like an elephant, the HTRC is very large with an exceptional memory — and is about to get a whole lot bigger.
Partnering with close to 100 research libraries from around the world, HathiTrust holds about 595 terabytes of digitized textual data — that’s about 157 miles, or 10,000 tons of text. In 2010, HathiTrust launched the HTRC to help researchers around the world accomplish tera-scale data mining and textual analysis. The HTRC is a collaborative effort among Indiana University; the University of Illinois, Urbana-Champaign (UIUC);and the University of Michigan.
Until recently, the HTRC had access to less than a third of the full HathiTrust repository. That all changed this year, and now the HTRC is working with the University of Michigan to enable analysis of the entire 5 billion pages of textual data in the HathiTrust repository.
“This will be the first time that a researcher could analyze, as data, a collection that is equivalent to some of the largest research libraries in the world,” says Robert McDonald, associate dean of libraries at Indiana University.
This poses a new challenge for the HTRC. Most of the texts in the HathiTrust remain under copyright, so one of the chief HTRC goals is to ensure non-consumptive research access to these protected works. This stipulation has led the HTRC to create the Secure HathiTrust Analytics Research Commons (SHARC), a secure framework for researcher access to restricted content.
Non-consumptive research is limited to computational analysis, to looking for patterns and key passages without reading the actual text in question. Researchers can query the large data sets within the HTRC framework without violating copyright law.
To preserve scientific access to this data, a distinction is made between human users and proxy, or computational, users. With the security provided by SHARC, users and copyright holders alike are protected from the sting of copyright infringements — intellectual property can’t be reassembled, so it’s impossible to steal.
“We are the research arm of the HathiTrust,” says HTRC co-director J. Stephen Downie. “We provide research services, new ways of archiving these materials, and new ways of discovering what’s in the collection.” In other words, the HTRC does the computational heavy lifting for researchers, offering the software tools and secure cyberinfrastructure needed for high-performance computing (HPC) textual analysis.
For instance, Vernon Burton’s textual analysis of evolving US attitudes toward slavery — illuminating southern and non-southern perspectives as seen through decades of antebellum literature — was only possible with support from the HTRC.
Since he was exploring in excess of one million texts, Burton's task required the advanced computational techniques, algorithms, and XSEDE supercomputing resources the HTRC facilitated. With the HTRC’s assistance, Burton and colleagues even constructed a new tool: Simply by typing in a word, a researcher can see the geographic locations of authors most closely related to the word’s use.
“This work would not have been possible without the HathiTrust and their expertise using the Palmetto and Blue Waters supercomputers. Because of their help, we are rethinking what we thought we knew about the American South,” says Burton, professor of history and director of the Cyberinstitute at Clemson University (US).
Perhaps the main advantage the HTRC brings is the ability to read patterns over a large volume of textual data. Some patterns are just too large to be perceived by a human reader, says Ted Underwood, HTRC researcher at UIUC.
“You can observe a pattern in a single work, or in a single author, but when there’s a pattern only visible when you’re comparing hundreds or thousands of books, an individual reader just can’t see that. With the HTRC, we will see patterns that were invisible before.”