
In 1973, a fire broke out at the National Personnel Records Center in St. Louis, Missouri, US, destroying 16 to 18 million military service records dating from 1912 to 1964. If the records had been digitized, they would have been unharmed, but not necessarily more accessible.
Scanned portable document format (PDF) — the low-cost, high-speed method for digitizing — images can easily be duplicated and stored in multiple locations. Digitally searching and finding anything within them, however, is nearly impossible. Locating anything of interest requires human eyes to visually search the handwritten text. It’s an impractical solution — especially considering the 1940 US census, for example, consists of 3.6 million PDF images.
Commercial services, like Ancestry.com, employ thousands of workers who manually extract the meaning of a small, profitable subset of images, notes Kenton McHenry of the Image and Spatial Data Analysis Division (ISDA) at the National Center for Supercomputing Applications (NCSA) in Urbana-Champaign, Illinois, US. Government agencies, on the other hand, don't have the resources necessary to make most images accessible in this way. The danger, says McHenry, is that scanned images may become a digital graveyard for many historically, culturally, and scientifically valuable documents.
NCSA research programmer Liana Diesendruck, McHenry, and colleagues have turned to a number of XSEDE resources to crack this formidable problem. Using the 1940 census as a test case, they’ve created a framework for automatically extracting meaning from the images — in essence, teaching machines to read cursive script. The team employed the Steele supercomputer at Purdue University in Indiana, US, and Ember at NCSA, for much of their initial data processing. NCSA's Mass Storage System (MSS) held the initial data set, which moved to the University of Illinois campus cluster when the MSS retired.
In their most recent work, the researchers used Blacklight at the Pittsburgh Supercomputing Center (PSC), in Pennsylvania, US. The supercomputer’s large shared memory has furthered the group’s search system alpha testing. The group has also begun storing data on PSC's Data Supercell. Throughout, XSEDE's Extended Collaborative Support Service helped optimize the performance of the various resources.
Teaching machines to read
"Before we could even think about extracting information, we had to do a lot of image processing," says Diesendruck. Misalignments, smudges, and tears in the paper records had to be cleaned up first. The difficulty of the cleanup process paled, however, in comparison to the task of getting the computer to understand the handwritten text.
It's relatively simple to get a computer to understand text that has been created electronically. It knows, for example, that an ‘A’ is an ‘A,’ and that the word ‘address’ refers to a location. By contrast, many different people with different styles of cursive script and handwriting contributed to early census entries. These entries can be difficult for humans to read, let alone machines.
Having the computer deconstruct each handwritten word, letter-by-letter, is impossible using today's technology. Rather than have the computer try to read them, the investigators had it analyze the words statistically. Key factors — such as the height of a capital ‘I,’ the width of a loop in a cursive ‘d,’ and by how many degrees the letters slant from the vertical — all go into a 30-dimensional vector. These measurements constitute a kind of address that the computer can use to locate words it knows.
PSC's Blacklight proved ideal for the task, McHenry says. Part of the computational challenge is in crunching data from different, largely independent entries as quickly as possible. Blacklight, while not as massively parallel as some supercomputers, has thousands of processors that do just that. More importantly, Blacklight's shared memory could handle the massive amount of data the researchers extracted from the census collection — a 30-dimensional vector for each word in each entry — and enable computation to proceed at a faster pace without many return trips to the disk.
‘Good enough’ accuracy: But scalable!
The system can retrieve word matches despite the idiosyncrasies of the handwriting. Of course, as in any computer-vision based system, it also returns incorrect results. The idea is to quickly produce a ‘good enough’ list of 10 to 20 entries that may match a person's query rather than taking far longer to make an exact match.
"We get some results that aren't very good," Diesendruck says. "But the user can click on the ones he or she is looking for. It isn't perfect, but instead of looking through thousands of entries you're looking at 10 to 20 results."
Search engines like Google have made people very demanding in terms of wait time for search results. While searchers expect fast, they don't expect extreme precision; they don't mind scanning short lists of possible answers to their query. The script search technology is similar to what they’re accustomed to seeing, making acceptance likely.
There is one other virtue to how the system works, McHenry points out. "We store what the user says was correct," using the human searcher's choice to identify the right answers and further improve the system. Such crowd sourcing allows the investigators to combine the best features of machine and human intelligence to improve output. "It's a hybrid approach that aims to keep the human in the loop as much as possible."
Today the group is using Blacklight to carry out test searches, refining and preparing the system to search all manner of handwritten records. Their work will help keep those records alive and relevant. It will also give scholars — not only in the ‘hard’ sciences, but also in the humanities — the ability to use and analyze thousands of documents rather than just a select few.
Comments
Post new comment