Feature - Many millions of manuscripts: data mining and digitized objects
James Evans, Assistant Professor in Sociology at the University of Chicago, is a Teraport regular, routinely occupying up to 30 processors at a time for his work on citation network analysis.
Crunching through hundreds of CPU hours, Evans identifies patterns of interaction between universities and the biotechnology industry, using Teraport to compare the citations of every article with those of every other article in his database—more than 25 million citations.
In work that requires even more computing power, Evans also analyzes the relationships between authors and organizations producing these documents, and the words within them, to identify the scientific subfields they address.
Distributed computing was possible at his doctoral institution, Evans said, “but the computers on the network were of different sizes, had slightly different software and sometimes different operating systems.”
Teraport offers a uniform operating system and software that, combined with other features, has in some cases saved Evans months of computing time, he said.
Ten million digital books
The analysis of the growing numbers of digitized books and text poses massive challenges and opportunities, and UC has joined a consortium of twelve universities working to digitize up to ten million books as part of the Google Book Search Project.
“In digital humanities we will be facing massive amounts of textual material in the next three or four years,” said Mark Olsen, Assistant Director of the Project for American and French Research on the Treasury of the French Language.
“There are a number of teams, including the ARTFL Project, which are ramping up to adopt machine-learning technologies on how to handle a million books.
Olsen said that although the amount of computer power required by ARTFL’s projects is probably tiny compared to projects from the sciences, it is nevertheless critical to have access to this power.
“Even small tests on our highest-power machines would take 15 or 20 hours to run. These kinds of runs are much faster on the Teraport,” Olsen said. “It extends our capabilities quite a bit.
- Steve Koppes, University of Chicago