Isaac Asimov, the American science fiction and popular science writer, famously said, "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' (I found it!) but 'That's funny. '"
In a world swimming in information, how does a scientist have such a revelation? How do they find a needle of insight in a growing digital haystack?
They use scientific visualizations, an essential yet often overlooked tool for discovery. The process of visualization converts data—from sensors, DNA sequencers, social networks, and massive high-performance computing simulations and models—into images that can be perceived by the eye and explored and interpreted by the human mind.
This aspect of discovery has always been valuable, but as our ability to simulate subatomic particles, perform high-resolution 3D scans of the body, or map the universe improves, turning that data into useful information is increasingly critical.
In November 2008, the U.S. National Science Foundation requested proposals for "TeraGrid Phase III: eXtreme Digital Resources for Science and Engineering (XD)." The grants funded the first of a new class of computing systems: two state-of-the-art resources at the Texas Advanced Computing Center and the National Institute for Computational Sciences that together increased the visualization and data analysis capabilities of the open science community significantly.
The NSF solicitation was motivated by an awareness that simulations on high-performance computing systems and data from new scientific instruments were producing copious amounts of information that could not be analyzed or visualized by any previous system.
"We were seeing science at a completely different scale," said Kelly Gaither, principal investigator for the XD Vis award and director of visualization at TACC. "These systems address the data deluge that we saw coming down the pipe as a result of the bigger HPC systems."
TACC's Longhorn was deployed in January 2010 and has been supporting visualization, data analysis, and general computing for a year and a half. A Dell cluster with both NVIDIA GPUs and Intel quad-core CPUs on each node, Longhorn provides unprecedented capabilities. Foremost among them is the ability to remotely visualize massive data sets in real-time.
This means a research group in Topeka, Kansas, can compute and visualize their dataset on the Longhorn system in Austin, Texas, from the quiet of their offices. The researchers can move, spin, zoom, and, in some cases, animate the subject with the touch of a button.
Gaither thinks this new capability—a hands-on approach to virtual experiments—improves scientists' relationship to their data and has the potential to transform research.
"Oftentimes, researchers don't know what they're looking for. They use visualization to do debugging or to do exploratory analysis of their simulation data. In those cases, visualization is really the only way to see," Gaither said. "It's generally recognized in the vis community that interactivity is a crucial component of being able to do that analysis."
Longhorn is the largest hardware-accelerated interactive visualization cluster in the world and has supported these real-time interactions for users as remote as Saudi Arabia. Longhorn is also able to manage incredibly huge data sets, including highly detailed visualizations created to study the instabilities in a burning Helium flame.
Nautilus, an SGI Altix UV1000 system, is another large computing system designed for remote visualization and analysis. It, too, has significant amounts of computational and GPU capacity, but it has a significantly different architecture than Longhorn. Nautilus is a symmetric multiprocessor (SMP) machine, one where the system shares all of the available memory with all of the processors. The scientists see 1,024 CPU processors and 4 TB of memory as one single system. The system also contains eight GPUs for general-purpose processing and hardware-accelerated graphics.
"Graph and societal network analysis. Correlation and document clustering. There are all sorts of analyses that are not amenable to a cluster type of architecture," explained Sean Ahern, director of the Center for Remote Data Analysis and Visualization at NICS (the center that operates Nautilus), and visualization task leader at Oak Ridge National Laboratory. "We've been able to accelerate the science that researchers are already doing, taking it from weeks to hours, and we have other projects where the size of the memory means researchers can pull in entire datasets where they were never able to do so before."
Rather than proposing pure visualization systems, as have dominated in the past, these machines were built to be multipurpose, allowing interactive and batch visualization, GPGPU (general-purpose GPU-based) computing, traditional HPC computing, and new kinds of data analysis.
This composite nature allows the systems to provide improved visualization resources to the academic community, while remaining fully used to maximize the public investment.
Like all resources in the XSEDE infrastructure, Longhorn and Nautilus run 24 hours a day, 7 days a week, 365 days a year, and are supported by expert staff at the host centers. The resources are available to U.S. researchers through an XSEDE allocation from the National Science Foundation.
Over the course of the past year and a half, 1,560 scientists have used Longhorn and Nautilus, applying their unique speed and capabilities to wide-ranging science problems, while also exploring what role GPU-processing can play in science generally.
The results emerging from the systems are encouraging.
Some of the notable successes on Longhorn are a collaboration with the National Archives and Record Administration to develop a new visualization framework for digital archivists; visualizations of the Gulf oil spill that helped the National Oceanic and Atmospheric Administration and the Coast Guard locate and contain oil slicks; record-setting molecular dynamics simulations of surfactants, which are used in detergents, manufacturing, and nanotechnology; and visualizations of the earthquake in Japan.
"With our analysis code, I get as much as 16,000 times speedup on Longhorn, which has given much insight into the physics of the protein-water interface, and allows us to understand at a more fundamental level how nature designs proteins to catalyze reactions under non-extreme conditions," said David LeBard, a postdoctoral fellow in the Institute for Computational Molecular Science at Temple University.
Simulations by LeBard and his collaborator Dmitry V. Matyushov appeared in the Journal of Physical Chemistry B and were featured on the cover of Physical Chemistry Chemical Physics in December 2010.
Nautilus has seen similar successes. Researchers on the system have performed unprecedented species modeling in the Great Smoky Mountains National Park, a biodiversity hot spot; gained new insights in the role turbulence plays in fusion; and explored how human society has evolved over the last half-century using historical sources.
"Nautilus has been a critical enabling resource for the GlobalNet project in several ways," said Kalev Leetaru, senior research scientist for content analysis at the Illinois Institute for Computing in Humanities, Art and Social Science (I-CHASS). "Most visibly, the ability to instantly leverage terabytes of memory in a single system image has allowed the project for the first time to move beyond small 1 to 5 percent samples to explore the dataset as a whole, leading to numerous fundamental new discoveries simply not possible without the ability to analyze the entire dataset at once."
Together, the two systems have supported 759 projects, totaling 11.4 million computing hours (the equivalent of 1,250 years on a single desktop system) in the last year and a half.
Visualization and data analysis are clearly moving into the mainstream, and with the Extreme Digital visualization grants, the NSF has given a big boost to the national science community. Gaither and Ahern believe this could be the beginning of a new paradigm.
"Seeing the visualization and interacting with the data is probably one of the great enablers that will propel science for the next generation and beyond," Gaither said. "I think in some respects, you won't even see this intermediate thing called a ‘dataset'. You will interact with the simulations itself, or, if you'd prefer, with the science."
Ahern went further.
"Data without analysis is nothing," Ahern said. "If you've run a giant simulation, you've only done half the work. The real science comes from processing that data into something that people can understand. The job of science is done in the phase of analysis, and that's purely where we live."
A version of this story first appeared on the XSEDE website.