| Zhiling Lan is working to increase Mean Time Before Failure by introducing adaptive fault tolerance. Image courtesy of Zhiling Lan | “We would like to see FT-Pro used to help avoid anticipated failures, and to help applications tolerate unforeseeable failures, so that the impact of any failure is kept to a minimum,” explains Lan. The system works by allocating a couple of spare nodes, used as an extra hand to juggle jobs on and off nodes where failure is predicted. Usually kept idle, these spare nodes provide the luxury of migration away from failing nodes, buying some downtime for their recovery or restart, and thus minimizing application execution times. Trace-based experiments on the IA32 Linux cluster at Argonne National Laboratory (part of TeraGrid) have indicated that FT-Pro can effectively improve the performance of parallel applications in the presence of failures by avoiding anticipated failures and skipping unnecessary fault tolerance overhead. For example, when running Enzo, using FT-Pro on the 96-node IA32 TeraGrid/ANL cluster reduced application completion time by up to 43%, as compared to when purely relying on periodic checkpointing. FT-Pro is supported in part by the United States National Science Fund, IIT startup fund, and TeraGrid Wide-Roaming Allocation. - Cristy Burne, iSGTW |