New computational techniques developed at Lawrence Berkeley National Laboratory (LBNL) in California, US, may help save scientists from drowning in their own data. Computational scientists at the lab have figured out how to streamline the analysis of enormous scientific datasets. The analysis uses the same techniques that make complex subway systems understandable at a glance.
A new approach
Sophisticated sensors and supercomputers are generating bigger and more complex scientific datasets than ever before. In disciplines like genomics, combustion, and climate science, these datasets can range from tens of terabytes to several petabytes in size. A petabyte of data is equivalent to storage for 13.3 years of high-definition television.
To tease out significant features for analysis, many scientists turn to a branch of mathematics called topology, which characterizes shapes of objects without considering aspects like length or angles—simplifying them the same way a subway map turns a complex maze of tunnels, trains, and stations into colored lines and dots. However, scientific data is now so massive and complex that even simplified topological representations are becoming difficult to analyze efficiently. As a result, more researchers are turning to massively parallel supercomputers to study their data.
Existing algorithms for analyzing data topologically do not take full advantage of supercomputers’ thousands of processors, so LBNL computational scientists created a new approach called distributed merge trees. This approach will let scientists make better use of next-generation supercomputers, and quickly separate significant features from “noisy” data.
“The growth of serial computational power has stalled, so data analysis is becoming increasingly dependent on massively parallel machines,” says Gunther Weber, a computational researcher in LBNL’s visualization group. “To satisfy the computational demand created by complex datasets, algorithms need to effectively use these parallel computer architectures.” Both Weber and LBNL postdoctoral researcher Dmitriy Morozov pioneered the distributed merge tree approach to topological data analysis.
If you’ve looked at a pocket map of the London Underground or New York City subway system, you’ve seen topology in action. These seemingly simple representations ignore details like distance and physical locations of stations, but still preserve important information like which line a station is on, how different lines are connected, and the overall structure of the network.
In topology, what matters most is how things are connected. If you disregard distance, the size of an object no longer matters. The object can be stretched or squeezed and still remain topologically unchanged – so a large complicated structure like London’s tube can be condensed into an easy-to-read, pocket-sized map by omitting geography and placing subway stations evenly on a line. Likewise, topology can be used to map the distribution of galaxies in the universe or burning regions in a combustion simulation.
Sifting through the noise
After generating a dataset, scientists can use the distributed merge tree algorithm to translate it into a topological map. The algorithm scans the entire dataset and tags values of interest to the scientists, as well as merge points or connections in the data. For example, if the dataset is a height map of the Himalayas, the distributed merge tree algorithm will initially scan the entire dataset and record all the peaks in the mountain range. Then, it will generate a topological map illustrating how the different mountains are connected. Morozov notes that these connections help researchers quickly differentiate between real features and noise in the data.
“A quick search for the six tallest mountains in the Himalayas will show Mount Everest, K2, Kangchenjunga, Lhotse, Makalu, and Cho Oyu – but there are many more peaks in this range,” says Morozov. “For example, the peak of Lhotse Shar is actually higher than Cho Oyu, but it doesn’t count as one of the highest mountains because Lhotse Shar’s base – where it merges into Lhotse – is almost as high as its peak.”
Thus, a researcher who is only interested in the tallest mountains might consider Lhotse Shar noise in the data. A quick search of a topological map generated by the distributed merge tree will show that this mountain merges into Lhotse, and disregard it based on the query.
“This analogy applies to many areas of science as well,” says Morozov. “In a combustion simulation, researchers can use our algorithm to create a topological map of the different fuel consumption values within a flame. This will allow scientists to quickly pick out the burning and non-burning regions from an ocean of noisy data.”
Parallelizing the search
Distributed merge trees take advantage of massively parallel computers by dividing big topological datasets into blocks, and then distributing the workload across thousands of nodes. A supercomputer like the US National Energy Research Scientific Computing Center’s (NERSC’s) Hopper contains about 6,384 nodes, each of which contains 24 processor cores. In this approach, each block of data is assigned to a single node. Additionally, each node also stores an extremely simplified global map, much like a subway map. The nodes communicate only to exchange information pertaining to a shared boundary.
“Although the individual nodes simplify non-local parts of the map, each portion of the map is still available in full resolution on some node, so the combined analysis of the map is the same as the un-simplified map,” says Morozov. “By identifying thresholds, or tags, scientists can ensure that the desired level of detail required for analysis is not disregarded as the global map is simplified.”
“Today, most researchers will have only one node keep track of the big picture, but the memory on a single compute node is often insufficient to store this information at the desired detail for analysis,” says Weber. “These problems are further exacerbated by the trend in supercomputing to add more processor cores to a chip without adding correspondingly more memory. As a result, each new generation of parallel computers is operating with less memory per core than the previous one.”
According to Weber, once a global topological representation is resident on a single node, it’s difficult to parallelize queries for features and derived quantities. This, in turn, leads to processor underutilization and slower queries. “By reducing the tree size per node while maintaining a full accurate representation of the merge tree, we speed up the topological analysis and make it applicable to larger datasets,” says Weber. “This is also an important step in making topological analysis available on massively parallel, distributed memory architectures.”