Failure domain

In computing, a failure domain encompasses a physical or logical section of the computing environment that is negatively affected when a critical device or service experiences problems. To put it another way, failure domains are regions or components of infrastructure that could fail. Each has its own risks and challenges to architect for.

The size of a failure domain and its potential impact depends on the device or service that is malfunctioning. For example, a router potentially experiencing problems would generally create a more significant failure domain than a network switch would. Smaller failure domains reduce the risk of disruption over a large section of a network, and eases the troubleshooting process.

Redundancy within failure domains is a key approach to help mitigate the risks of failure. For example, technologies like RAID helps mitigate the risks of drive failure by creating multiple data copies. Replication helps to mitigate the risks of server or storage array failure.