iSGTW - International Science Grid This Week
iSGTW - International Science Grid This Week
Null

Home > iSGTW - 22 April 2009 > iSGTW Feature - Embrace failure - TeraGrid fault tolerance workshop

Feature - Embrace failure!


Don Lamb addressing the Fault Tolerance for Extreme Scalability Workshop, co-sponsored by The National Science Foundation Office of Cyberinfrastructure’s Blue Waters and TeraGrid projects.

Image courtesy of TeraGrid External Relations.

Can smart checkpoints and fault-resilient applications avert a Malthusian Catastrophe?

As more powerful systems encompass ever-increasing numbers of components, even a small fault rate on individual processors will generate multiple faults across the components, stopping long-running applications in their tracks.

At a workshop in March, U.S. experts met to discuss issues relating to the fault-tolerance of today’s and tomorrow’s petascale and exascale computing systems. The group explored past practices and common pitfalls, and discussed strategies to ensure that these systems and the applications they run can tolerate the inevitable faults.
 
“It is invaluable for the systems specialists, middleware designers, and applications scientists to share their experiences and to talk about their expectations for other parts of the HPC ecosystem. This is the only way we will know what works, what doesn’t work, and what we still need to do,” said Daniel S. Katz, TeraGrid Grid Infrastructure Group Director of Science and lead organizer of the workshop.

While sharing her experiences with Kraken, TeraGrid's largest supercomputer, Patricia Kovatch of the National Institute for Computational Sciences made an analogy with Thomas Malthus' famous prediction about geometric population growth (as the population gets bigger, it grows faster) versus a constant rate of growth in agricultural output. She claims that a similar dichotomy exists between the growth in application size and system complexity, and the rate of improvement in failure mitigation techniques.

“To stave off this Malthusian Catastrophe,” she said, “we are leveraging some of the same techniques that agriculture has: concentrating resources and making large infrastructure investments, developing wider markets and better distribution networks, and implementing more efficient technologies.”

Don Lamb, a University of Chicago professor and Director of the ASC/Alliance Flash Center, presented experiences from three production runs of simulation software, called FLASH, used by scientists in fields such as cosmology and plasma physics. 

“FLASH handles astronomically large ranges of values of physical quantities, and operates at the upper level of available memory,” said Lamb. “Consequently, it has walked into almost every hardware or software limitation in the high end systems.”

“A checkpoint/rollback capability is in place,” he said, referring to a feature that saves a snapshot of a job's progress from which it can be restarted at a later time. “But it is controlled by the application, which has no way of detecting imminent component failures.  If a failure happens just before checkpointing, rollback can be expensive.”  He suggested a solution, called Fault Tolerance Backplane, that could keep the application informed about the state of the machine and use this knowledge to write a checkpoint before an imminent failure, thereby avoiding the expensive recovery scenario.

John Daly, addressing the workshop.

Image courtesy of TeraGrid External Relations 

Several tool and application developers and other systems specialists shared their experiences regarding faults and resiliency, methodologies for acceptance testing, and performance metrics that recognize inevitable events such as chassis failure, boot failure, silent corruption, and more.

John Daly of the Research Directorate at the National Security Agency currently leads an effort on resilience for the Advanced Computing Systems research program. He advocates a focus shift from fault-tolerance in systems to resilience in applications.

Daly outlined three problems he sees in fault-tolerance approaches. First, as the number and density of components increases, so do the system faults, and recovery-based fault-tolerance is approaching a theoretical limit. Second, redundancy-based schemes increase the share of resources dedicated to fault recovery. Third, silent failure modes — intolerable for many application users — reduce monitoring effectiveness, and hence both application progress and certainty of correctness.

Resilience, on the other hand, an application-centric paradigm, aims to protect applications from data corruption and Byzantine faults, Daly said. It aims to do so in a timely and efficient manner (considering tradeoffs in power, productivity and performance) and in the presence of hardware or software degradations and failures.

"Fault tolerance uses redundancy and replication to recover from failure,” he said. “Resilience offers a more integrated approach in which the system works with applications to keep them running in spite of component failure."

Elizabeth Leake, TeraGrid, and Anne Heavey, iSGTW

A report of the proceedings will be available on the TeraGrid Web site.

 Please fill out a short questionnaire about your fault treatment mechanisms for Alexandre Duarte, a Ph.D. student in Computer Science at the Universidade Federal de Campina Grande in Brazil who is investigating this topic in the context of the EELA-2 and OurGrid projects.

Tags:

Share this page:

Disclaimer:
These are external Web sites and iSGTW cannot guarantee their security nor endorse their content.



Null
 iSGTW 3 February 2010

Feature - Cosmic simulation

Feature - Cloudbus: A tool for utility-oriented cloud computing

Back to Basics - What makes parallel programming hard?

Blog post of the week - LOLCats get on the Grid

Video of the week - Computation and tomography

 Announcements

Women of Vision Awards Banquet registration

Call for papers: Life Sciences Workshop

Applications due for grid application porting school

Call for submissions: TeraGrid 2010

Jobs in grid, 19 NEW

 Subscribe

Enter your email address to subscribe to iSGTW.

Unsubscribe

 iSGTW Blog Watch

Keep up with the grid’s blogosphere

 Mark your calendar

February 2010

8-11, APAN Sydney

10-11, WALCOM 2010

11-14, CMM MardiGras

15-18, GloSec2010

15-19, IASTED Innsbruck

17-18, PDP 2010

18-22, AAAS

22-27, ACAT 2010

March 2010

4-6, STACS 2010

More calendar items...

FooterINFSOMEuropean CommissionDepartment of EnergyNational Science Foundation RSSHeadlines | Site Map