Big data is more than just a buzzword; it’s also a tool Rice University bioengineer Amina Qutub uses to save lives. She has designed an algorithm called ‘progeny clustering’ to identify which treatments should be given to children with leukemia.
Clustering is important for its ability to reveal information in complex sets of data like medical records. The technique is used in bioinformatics — a topic of interest to Rice scientists who work closely with fellow Texas Medical Center institutions.
Details of the work appear in Nature’s online journal Scientific Reports.Paper co-authors include Steven Kornblau, a professor in the departments of Leukemia and Stem Cell Transplantation at the University of Texas MD Anderson Cancer Center, and John Slater, an assistant professor of biomedical engineering at the University of Delaware.
The US National Science Foundation, the Leukemia and Lymphoma Society, and the Howard Hughes Medical Institute Med-Into-Grad Fellowship supported the research.
“Doctors who design clinical trials need to know how to group patients so they receive the most appropriate treatment,” Qutub says. “First, they need to estimate the optimal number of clusters in their data.” The more accurate the clusters, the more personalized the treatment can be, she says.
Separating groups by a single data point, like eye color, would be easy. But separating people by the types of proteins in their bloodstreams is more difficult. “That’s the kind of data that’s become prevalent everywhere in biology, and it’s good to have,” Qutub says. “We want to know hundreds of features about a single person. The problem is identifying how to use all that data.”
The Rice algorithm provides a way to ensure the number of clusters is as accurate as possible, she says. The algorithm extracts characteristics about patients from a data set, mixing and matching them randomly to create artificial populations — the ‘progeny,’ or descendants, of the parent data. The characteristics appear in roughly the same ratios in descendants as they do among the parents.
These characteristics, called dimensions, can be anything: as simple as hair color or place of birth, or as detailed as one’s blood cell count or the proteins expressed by tumor cells. For even a small population, each individual may have hundreds or even thousands of dimensions.
By creating progeny with the same dimensions of features, the Rice algorithm increases the size of the data set. With this additional data, the distinct patterns become more apparent, allowing the algorithm to optimize the number of clusters that warrant attention from doctors and scientists.
Qutub and lead researcher Wendy Hu, a graduate student in her lab at Rice’s BioScience Research Collaborative, say their technique is just as reliable as state-of-the-art clustering evaluation algorithms, but at a fraction of the computational cost. In lab tests, progeny clustering compared favorably to other popular methods, they wrote, and it was the only method to successfully discover clinically meaningful groupings in an acute myeloid leukemia reverse phase protein array data set.
Progeny clustering also allows researchers to determine the ideal number of clusters in small populations, Qutub says. In fact, the algorithm is now at work in an ongoing trial involving leukemia patients at Texas Children’s Hospital. There, Qutub says, “progeny clustering allowed them to design a robust clinical trial, even though that trial did not involve a large number of children. It meant they didn’t have to wait to enroll more.”
Technologies that gather data about patients — from sophisticated hospital equipment to simple wrist-worn health monitors — are advancing rapidly. That puts a premium on tools that can decipher growing mountains of data. Although ten patients may be few in number, there may be hundreds or thousands of dimensions for each.
“Big data is just numbers, but the numbers don’t have any value if you don’t get information from them,” Hu says. “My job is to look at these numbers and use computational tools and insights from biology to generate new information. This can help us know more about diseases and come up with therapeutic solutions and diagnostic schemes and identify new drug targets.”
The lab plans to make the algorithm available for free through its website.
--Mike Williams, senior media relations specialist, Rice University's Office of Public Affairs
--Thanks for reading iSGTW. Stay with us as we become The Science Node on 16 September.