User:Joao.duraes/Sandbox

THIS IS A WORK-IN-PROGRESS PAGE If you are not with Amber, ignore this and come back in a few days authorship of this text belongs to authors of D2.1/Amber - To be released soon --

Fault Injection

A critical issue in the development of a resilient computer system is the validation of its fault-handling mechanisms. Ineffective or unintended operation of these mechanisms can significantly impair the dependability of a computer system. Assessing the effectiveness and verifying the correctness of fault-handling mechanisms in computer systems is therefore of vital importance. Fault injection is an important experimental technique for assessment and verification of fault-handling mechanisms. It allows researchers and system designers to study how computer systems react and behave in the presence of faults. Fault injection is used in many contexts and can serve different purposes, such as: Over the years, many researchers have addressed the problem of validating fault-handling mechanisms by fault injection. Numerous papers on assessment and verification of fault-tolerant systems or individual mechanisms, and on fault injection tools have been published. This chapter gives an overview of the current state-of-the-art and selected important historical achievements in the area of fault injection. Fault injection can in principle be carried out in two ways: faults can be injected either in a real system or in a model of a system. By a real system we mean a physical computer system, either a prototype or a commercial product. System models for fault injection experiments can be built using two basic techniques: software simulation and hardware emulation.
 * Assess the effectiveness, i.e., fault coverage, of software and hardware implemented fault-handling mechanisms.
 * Study error propagation and error latency in order to guide the design of fault-handling mechanisms.
 * Test the correctness of fault-handling mechanisms.
 * Measure the time it takes for a system to detect or to recover from errors.
 * Test the correctness of fault-handling protocols in distributed systems.
 * Verify failure mode assumptions for components or subsystems.