Recovery-oriented computing

Recovery-oriented computing (sometimes abbreviated to ROC) is a method constructed at Stanford University and the University of California, Berkeley for developing reliable Internet services. Its proponents seek to recognize computer bugs as inevitable, and then reduce their harmful effects. The National Science Foundation funds the project.

There are characteristics that set recovery oriented computing apart from all other failure handling techniques.

Isolation and redundancy
Isolation in these types of systems requires redundancy. Should one part of the system fail, a redundant part will need to take its place. Isolation must be failure proof for all types of failures whether they be software or human caused failures. One potential way to isolate parts of a system is using virtual machine monitors such as Xen. Virtual machine monitors allow many virtual machines to run on a physical machine and should there be a problem with one virtual machine it can be restarted without restarting the physical machine, or it can be stopped and another can take its place.

System-wide undo support
The ability to undo across different programs and time frames is necessary in this type of system because human error is the cause of about half of system failures. Not having undo support also limits testing aspects of a production system because it doesn’t allow for trial and error.

System-wide undo support should cover all aspects of the system. This includes hardware and software upgrades, configuration as well as application management. There are obviously limits to what can be undone, and these limits are currently being explored, tested and rated based on their tradeoffs.

Integrated diagnostic support
Integrated diagnostic support is another characteristic a recovery-oriented computer should have. This means that the system should be able to identify the root cause of a system failure. Once it does this it should then either be able to contain the failure so it cannot affect other parts of the system or alternatively it should repair the failure. All of the system components or modules should be self-testing; it should be able to know when there is something wrong with itself. As well as determining problems with themselves, the modules should also be able to verify the behavior of other modules that they are dependent upon. The system must also track module, resource, and user request dependencies throughout the system. This will allow for containment of failures.

Online verification and recovery mechanisms
Recovery mechanisms are ways in which the systems can recover from failures. These recovery mechanisms should be well designed, meaning that they are reliable, effective and efficient. These systems should be proactive in testing and verifying the behavior of the recovery mechanisms so should there be a real failure it is certain that these mechanisms will do what they are designed to do and aid in the recovery of the system. These verifications should be performed even in production level equipment as this type of equipment is the most vital to have up. There are two methods for performing these tests and both of these should be used. The first method is directed tests in which the tests are set up and executed. The other method is a random test in which they occur without warning.

Modularity, measurability and restartability
Software aging problems are best resolved by restarting the component that is affected. This entails both modularity and restartability. Components should be restarted before they fail, and designed to make this option available or better yet, do it automatically. Applications should also be designed for restartability.

Benchmarks
These systems should have frequent dependability and availability benchmarking to justify their existence and usage by tracking their progress. These benchmarks should be reproducible and an impartial measure of system dependability, reliability, and availability.