Feature - Adaptive fault tolerance for improved reliability
So there’s the good news. And then there’s the bad news.
The good news is that high performance computing systems are getting bigger.
And the bad news? As system size increases, Mean Time Before Failure is dramatically reduced: the number of hours you can run your application before everything grinds to a halt just keeps getting smaller.
Yawei Li and Zhiling Lan of the Illinois Institute of Technology, U.S., want to change all that.
They have developed an adaptive fault management scheme, called FT-Pro, which has already improved the robustness of several real-world applications run on the TeraGrid, including Enzo, a software package for simulating cosmological structures, and GROMACS, a molecular dynamics package for studying molecular interactions.
“Applications like these are getting larger, running for longer, and using more processors,” Lan says. “But, since just one process failure can crash your entire application, these applications are extremely vulnerable to failure.”
The usual solution, says Lan, is either to undertake regular reactive checkpointing, or to be proactive and predict potential failures before they occur. Both options are fraught with complications.
“Regular checkpointing results in substantial performance overhead, while predicting failure can be very hit-and-miss. We wanted something that could combine the best of both these approaches.”
FT-Pro is Lan’s solution. The program works in conjunction with regular failure management tools, but introduces the flexibility of adaptive decision making: FT-Pro can make runtime decisions based on a user’s fault tolerance requests.
“We would like to see FT-Pro used to help avoid anticipated failures, and to help applications tolerate unforeseeable failures, so that the impact of any failure is kept to a minimum,” explains Lan.
The system works by allocating a couple of spare nodes, used as an extra hand to juggle jobs on and off nodes where failure is predicted.
Usually kept idle, these spare nodes provide the luxury of migration away from failing nodes, buying some downtime for their recovery or restart, and thus minimizing application execution times.
Trace-based experiments on the IA32 Linux cluster at Argonne National Laboratory (part of TeraGrid) have indicated that FT-Pro can effectively improve the performance of parallel applications in the presence of failures by avoiding anticipated failures and skipping unnecessary fault tolerance overhead.
For example, when running Enzo, using FT-Pro on the 96-node IA32 TeraGrid/ANL cluster reduced application completion time by up to 43%, as compared to when purely relying on periodic checkpointing.
FT-Pro is supported in part by the United States National Science Fund, IIT startup fund, and TeraGrid Wide-Roaming Allocation.
- Cristy Burne, iSGTW