Outage At UAA
With redundant hardware, it is rare that a disk failure results in downtime at the system level. System failures do sometimes occur, typically as a sequence of very rare events that leads to a catastrophic failure. This case describes how a combination of hardware and firmware failures, along with human error, led to the failure of a redundant disk storage unit, which in turn affected several enterprise systems at a major public university. Subsequently, a small number of conservative and seemingly “good” decisions in the process of restoring the system from backups led to negative outcomes, primarily additional downtime over the course of several days. The case illustrates how even well-considered and conservative decisions may seem flawed in hindsight. An important lesson from the case is that it is difficult to justify to management the provision of sufficient backup resources to prevent very low-probability failure events.