User:Sicardg/sandbox

Mean Time Between Operational Mission Failure (MTBOMF-SW)

 * MTBOMF-SW Compliance measured through engineering analysis
 * Field testing and operational usage feedback are also considered
 * Need to define intentional SWI&T effort geared at reliability to achieve the 25 hour test (or MTBOMF-SW)

Common Critical Software Errors

 * Exception Error
 * Failed processing of message
 * If not handled, termination of application
 * Memory Leak
 * Loss of usable memory for continued processing
 * Potential termination of application
 * Run away application (stuck in loop most common) due to bad SW logic
 * Application and processing core(s) are effectively unusable
 * Incorrect SW functionality (faulty logic)

SW Reliability Assumptions

 * Assume hardware is good
 * SW Reliability independent of HW Reliability
 * Hardware failures reported by BIT/CAL
 * The Subsystem SHALL continue to operate when recovering from a degraded condition without requiring a reboot
 * The Subsystem SHALL be permitted to fail operations when any of the following is failed X, Y, Z
 * Real-time operational software repair
 * Obtained through SW recovery methods
 * Offline software repair
 * Analysis of logged errors
 * New software/firmware build

Real-time SW Error Detection

 * OS kernel detects exception errors
 * State Manager secondary detects State Manager primary failed heartbeat
 * Heartbeat monitor reports failed heartbeats to State Manager
 * State Manager receives FAILED state status or incorrect state from other CSCIs
 * Any CSCI detects timeout of active thread
 * Operator observes non-responsiveness or invalid behavior of Subsystem (requests restart)

SW Reliability Improvement Methods

 * Offline SW reliability improvement methods
 * Design & Code Peer reviews / WPIs
 * Coverity static software analysis tool
 * Automated software testing
 * Full path testing on all safety code (No tool has been defined to perform this task)
 * Analyze faults, fix errors in future builds
 * Realtime SW reliability improvement methods
 * Use of Virtual Machines (VMs)
 * Validation of message data (Safety API)
 * Critical functional redundancy (Sys VRC)
 * Additional SW recoverability functionality
 * Logging of faults (Metrics)

Virtual Machines Use

 * Since a virtual machine is implemented as a Linux process it leverages the standard Linux security model to provide isolation and resource controls
 * This ensures that a virtual machines resources cannot be accessed by any other process (or virtual machine) and this can be extended by the administrator to define fine grained permissions, for example to group virtual machines together to share resources
 * Prevents lower priority or non-critical applications from corrupting higher priority critical applications
 * See CSS lead for configuration and use of VMs

SW Recoverability Methods

 * Recoverability restores the system to full operational capability but may contribute to minor software down time
 * Exception handling with recovery blocks to recover from constraint errors
 * Warm Restart
 * EA subsystem external command to restart (Operator Request)
 * restart EA subsystem
 * Automated restart: Sys VRC CSCI through FAILED state status and/or AMS Status
 * Option 1: Selective restart of non critical application
 * »DIOP Applications
 * »BIT/CAL, Recording, and Logging
 * Option 2: restart EA subsystem
 * CSCI local thread processing timeout
 * Option 1: restart non critical thread if possible
 * Option 2: report FAILED state
 * Critical functional redundancy of Sys VRC
 * Sys VRC secondary detects Sys VRC primary failed heartbeat