User talk:Failure Prediction and Diagnosis at IIT

FENCE: Fault awareness ENabled Computing Environment

As the scale of high performance computing continues to grow, fault management is becoming a critical challenge. Recent studies have pointed out that the MTBF of teraflop and petaflop machines are only on the order of 10-100 hours. This situation is only likely to deteriorate in the near future, thereby threatening the promising productivity of HPC systems. Checkpointing is the conventional method for fault tolerance. However, it only deals with failures after their occurrence through rollback. In case of one process failure, all processes including non-faulty processes have to be restarted from the previously saved state prior to the failure. Thus, significant performance loss can be incurred due to the work loss and failure recovery. Proactive approaches take preventive actions (e.g. preemptive process migration) before failures, thereby avoiding failures with low cost. Nevertheless, its effectiveness relies on perfect fault prediction, which is hardly achievable in practice.

This project aims at building FENCE, a Fault awareness ENabled Computing Environment for HPC. FENCE is "hybrid" by integrating long- term and short-term supports to enhance fault management in HPC. Long-term support models the possibility of faults based on historical data and consequently facilitates fault-aware scheduling by intelligently mapping jobs to available resources; and short-term/runtime support diagnoses runtime events and triggers job rescheduling on-the- fly to move running jobs away from these troublesome resources. FENCE is also "adaptive" by combining the merits of the newly emerged proactive fault tolerant approach and the traditional checkpointing approach. Proactive actions enable applications to avoid anticipated faults if possible, whereas reactive actions intend to minimize the impact of unforeseeable failures.

Faculty Members: Zhiling Lan Xian-He Sun

Graduate Students: Yawei Li Ziming Zheng Jin Hui Jiexing Gu Prashasta Gujrati

Collaborators: Rajeev Thakur (ANL) John White (SDSC)

Recent Talks: Z. Lan, "Building a Fault-aware Computing Environment for High End Computing", APART'07 (in conjunction with SC07)

Recent Publications: Y. Li, Z. Lan, P. Gujrati, and X. Sun, "Fault-Aware Runtime Strategies for High Performance Computing", SCS Tech Report, submitted for journal publication, 2007.

Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing", accepted by IEEE Trans. on Computers, 2007.

Z. Zheng, Y. Li, and Z. Lan, "Anomaly Localization in Large-scale Clusters", Proc. of IEEE Cluster'07, 2007.

M. Wu, X.-H. Sun and H. Jin, "Performance under Failure of High-End Computing", Proc. of SC'07, 2007.

P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, "Exploring Meta-learning to Improve Failure Prediction in Supercomputing Clusters", Proc. of ICPP'07, 2007.

Y. Li, P. Gujrati, Z. Lan, and X. Sun, "Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience", Proc. of ICPP'07, 2007.

Z. Lan, Y. Li, P. Gujrati, Z. Zheng, R. Thakur, and J. White, "A Fault Diagnosis and Prognosis Service for TeraGrid Clusters", Proc. of TeraGrid'07, 2007.

Y. Li and Z. Lan, "Using Adaptive Fault Tolerance to Improve Application Robustness on the TeraGrid", Proc. of TeraGrid'07, 2007.

Y. Li and Z. Lan, "Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing", Proc. of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid06), Singapore,2006.

Y. Li and Z. Lan, "Improving Fault Resilience of High Performance Applications", Research Poster at SC06 ,2006.

C. Du and X-H. Sun, "MPI-Mitten: Enabling Migration Technology in MPI", Pproc. of CCGrid'06 ,2006.

Z. Lan and Y. Li, "Failure-Aware Resource Selection for Grid Computing", Proc. of IEEE Conference on Dependable Systems and networks (Fast Abstract), 2006.

Y. Li and Z. Lan, "Improving Fault Resilience of High Performance Applications", Research Poster at SC06 ,2006.

C. Du, X.-H. Sun, and K. Chanchio, "HPCM: A Pre-compiler Aided Middleware for the Mobility of Legacy Code", Proc. of IEEE Cluster'03,2003.

Contact: Dr. Zhiling Lan (lan AT iit DOT edu)

This work is supported by US National Science Foundation