Across the spectrum, data has gotten big.
"If you look at the trend, databases are getting bigger and bigger," said Dora Cai, a database architect based at the US National Center for Supercomputing Applications. While 50 gigabytes would have been considered a large database not that long ago, "now we're talking about terabytes and hundreds of terabytes and even petabytes."
The Virtual Worlds Exploratorium and an ongoing census analysis project are two examples of data-intensive research in the humanities that show how NCSA's infrastructure and staff can help researchers address the challenges of big data.
Millions of people around the world play massively multi-player online role-playing games. And as they play, their every action—each time they fight a dragon, buy or sell armor, talk to another player—is logged by the game, creating a wealth of information about how people interact in these "virtual worlds."
Several years ago, Sony approached researcher Dmitri Williams, then at the University of Illinois and now at the University of Southern California, to see if he could use data gathered from EverQuest II to determine which players were likely to leave the game (and therefore stop paying to play). Williams was also interested in questions about whether in-game behavior correlated with behavior in the real world. Will someone with a violent, aggressive game character be more violent or aggressive in the real world, for example?
Williams teamed with co-principal investigators Marshall Scott Poole (University of Illinois), Noshir Contractor (Northwestern University), and computer scientist Jaideep Srivastava (University of Minnesota) to investigate a massive collection of game log data from EverQuest II and other games—Dragon's Nest and Chevalier's Romance, which are popular in China, and Denmark-based EVE Online. They call their collaboration the Virtual Worlds Exploratorium.
The researchers faced several challenges in working with these data:
The data is housed at NCSA because, "they have a lot of experience with large data and with making sure the data is securely handled," Contractor said. And the VWE team tapped Cai to create an organized database from the "messy" collection of log files.
If you aren't a data-focused researcher or computer scientist, you might miss the significance of that crucial step, but a collection of data isn't a useful database until it has been organized and structured and can be queried. That's where Cai's expertise in database came into play.
"We have all these great data, and we can ask loads of questions about interaction in the space," Contractor says.
Some of those questions have addressed group leadership, while others have addressed group formation: Why do people team up with one another in the game? Do groups form based on similarities, complimentary differences, proximity, etc.? Leadership is one of the areas of interest to the US Army and Air Force, which have both provided funding for VWE projects.
"This might be the best training ground for the kinds of leaders we will see tomorrow," Contractor said.
The researchers have also studied "illegal" transactions in which players sell currency, items, and even high-level characters to wealthier players who want the perks without putting in hours of game play to earn them. As games try to crack down on this behavior, the illicit sellers and buyers adopt new tricks to conceal their actions. One of Contractor's students, Brian Keegan, along with fellow student Muhammad Ahmad, found that the "illegal" networks in the game employ virtually identical strategies to those used by drug traffickers. The researchers also found that people who engage in illegal conduct in the game are more likely to have real-world criminal records.
A treasure trove of US Census data is released to the public after remaining confidential for 70 years. The standard practice has been for the Census Bureau to create microfilm images of the millions of paper forms. Companies that cater to genealogy buffs, like Ancestry.com, then hire thousands of people to spend months transcribing the microfilm so the data can be searched and sorted online.
But this April the detailed information on the more than 132 million people who lived in the United States in 1940 will be released in digital format. No more microfilm.
The Census Bureau would like to provide something more usable than 3.8 million JPEG images of census forms, but manual transcription is too expensive, and optical character recognition of the handwritten entries is not accurate enough. So NCSA's Image, Spatial, and Data Analysis group, led by Kenton McHenry, has been working for the past year on a prototype framework using content-based image retrieval to allow people to search the census form images directly. The project is supported by the National Archives and Records Administration.
The framework enables a user to input a handwritten query—either using a stylus or by typing a word that will be then rendered in a handwriting font—to search a database of images of handwritten text for potential matches. Using a computer vision technique known as word spotting, the top ranked results are returned.
While not all will be perfect matches, the system's users will help improve the results over time through a passive form of crowd sourcing. For instance, after searching for "Smith" a user isn't likely to click on results that are not "Smith." The query text entered by the user can be connected to the image results the user selected, allowing the image database to be slowly annotated. Over time, the validated matches can be returned to users rather than relying solely on the word spotting technique.
A significant amount of computation is required in order to pre-process the data to allow for the planned word spotting and passive crowd sourcing. The first step is to split the spreadsheet-like Census forms into individual data cells by finding the form lines and fitting a template over the images. Next, each extracted cell must be converted into a numerical feature vector that roughly represents the handwritten contents of that image. A word spotting technique compares the feature vector of the search query (such as a name, like Smith) to the feature vectors of the many, many cells, looking for similarities. To search all 70 billion cell images would be excessively time-consuming and computationally expensive, so a third step groups similar feature vectors and constructs a hierarchy on the data to narrow the search space and return results with reasonable speed.
The team is using an XSEDE start-up allocation to develop their system. An XSEDE Extended Collaborative Support Services team, led by NCSA's Jay Alameda, has helped the group get optimal performance out of their code, assisting with mapping processes to hardware and with I/O issues. The team has applied through XSEDE for 2 million CPU hours to be used to process the 1940 census records.
A version of this story first appeared on the NCSA website.