|
Can smart checkpoints and fault-resilient applications avert a Malthusian Catastrophe?
As more powerful systems encompass ever-increasing numbers of components, even a small fault rate on individual processors will generate multiple faults across the components, stopping long-running applications in their tracks.
At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults. “It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem. This is the only way we will know what works, what doesn’t work, and what we still need to do,” said Daniel S. Katz, TeraGrid Grid Infrastructure Group Director of Science and lead organizer of the workshop.
While sharing her experiences with Kraken, TeraGrid's largest supercomputer, Patricia Kovatch of the National Institute for Computational Sciences made an analogy with Thomas Malthus' famous prediction about geometric population growth (as the population gets bigger, it grows faster) versus a constant rate of growth in agricultural output. She claims that a similar dichotomy exists between the growth in application size and system complexity, and the rate of improvement in failure mitigation techniques.
“To stave off this Malthusian Catastrophe,” she said, “we are leveraging some of the same techniques that agriculture has: concentrating resources and making large infrastructure investments, developing wider markets and better distribution networks, and implementing more efficient technologies.”
Don Lamb, a University of Chicago professor and Director of the ASC/Alliance Flash Center, presented experiences from three production runs of simulation software, called FLASH, used by scientists in fields such as cosmology and plasma physics.
“FLASH handles astronomically large ranges of values of physical quantities, and operates at the upper level of available memory,” said Lamb. “Consequently, it has walked into almost every hardware or software limitation in the high end systems.”
“A checkpoint/rollback capability is in place,” he said, referring to a feature that saves a snapshot of a job's progress from which it can be restarted at a later time. “But it is controlled by the application, which has no way of detecting imminent component failures. If a failure happens just before checkpointing, rollback can be expensive.” He suggested a solution, called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.
|