I’m a connoisseur of failure. I love reading about engineering failures of all forms and, unsurprisingly, I’m particularly interested in data center faults. It’s not that I delight in engineering failures. My interest is driven by believing that the more faults we all understand, the more likely we can engineer systems that don’t suffer from these weaknesses.
It’s interesting, at least to me, that even fairly poorly-engineered data centers don’t fail all that frequently and really well-executed facilities might go many years between problems. So why am I so interested in understanding the cause of faults even in facilities where I’m not directly involved? Two big reasons: 1) the negative impact of a fault is disproportionately large and avoiding just one failure could save millions of dollars and 2) at extraordinary scale, even very rare faults can happen more frequently.
Today’s example is from a major US airline last summer and it is a great example of “rare events happen dangerously frequently at scale.” I’m willing to bet this large airline has never before seen this particular fault under discussion and yet, operating at much higher scale, I’ve personally encountered it twice in my working life. This example is a good one because the negative impact is high, the fault mode is well-understood, and although a relatively rare event there are multiple public examples of this failure mode.
Before getting into the details of what went wrong, let’s look at the impact of this failure on customers and the business. In this case, 1,000 flights were canceled on the day of the event but the negative impact continued for two more days with 775 flights canceled the next day and 90 on the third day. The Chief Financial Office reported that $100m of revenue or roughly 2% of the airline’s world-wide monthly revenue was lost in the fall-out of this event. It’s more difficult to measure the negative impact on brand and customer future travel planning, but presumably there would have been impact on these dimensions as well.
It’s rare that the negative impact of a data center failure will be published, but the magnitude of this particular fault isn’t surprising. Successful companies are automated and, when a systems failure brings them down, the revenue impact can be massive.
What happened? The report was “switch gear failed and locked out reserve generators.” To understand the fault, it’s best to understand what the switch gear normally does and how faults are handled and then dig deeper into what went wrong in this case.
In normal operation the utility power feeding a data center flows in from the mid-voltage transformers through the switch gear and then to the uninterruptible power supplies which eventually feeds the critical load (servers, storage, and networking equipment). In normal operation, the switch gear is just monitoring power quality.
If the utility power goes outside of acceptable quality parameters or simply fails, the switch gear waits a few seconds since, in the vast majority of the cases, the power will return before further action needs to be taken. If the power does not return after a predetermined number of seconds (usually less than 10), the switch gear will signal the backup generators to start. The generators start, run up to operating RPM, and are usually given a very short period to stabilize. Once the generator power is within acceptable parameters, the load is switched to the generator. During the few seconds required to switch to generator power, the UPS has been holding the critical load and the switch to generators is transparent. When the utility power returns and is stable, the load is switched back to utility and the generators are brought back down.
The utility failure sequence described above happens correctly almost every time. In fact, it occurs exactly as designed so frequently that most facilities will never see the fault mode we are looking at today. The rare failure mode that can cost $100m looks like this: when the utility power fails, the switch gear detects a voltage anomaly sufficiently large to indicate a high probability of a ground fault within the data center. A generator brought online into a direct short could be damaged. With expensive equipment possibly at risk, the switch gear locks out the generator. Five to ten minutes after that decision, the UPS will discharge and row after row of servers will start blinking out.
This same fault mode caused the 34-minute outage at the 2012 super bowl: The Power Failure Seen Around the World.
Backup generators run around 3/4 of million dollars so I understand the switch gear engineering decision to lockout and protect an expensive component. And, while I suspect that some customers would want it that way, I’ve never worked for one of those customers and the airline hit by this fault last summer certainly isn’t one of them either.
There are likely many possible causes of a power anomaly of sufficient magnitude to cause switch gear lockout, but the two events I’ve been involved with were both caused by cars colliding with aluminum street light poles that subsequently fell across two phases of the utility power. Effectively an excellent conductor landed across two phases of a high voltage utility feed.
One of two times this happened, I was within driving distance of the data center and everyone I was with was getting massive numbers of alerts warning of a discharging UPS. We sped to the ailing facility and arrived just as servers were starting to go down as the UPSs discharged. With the help of the switch gear manufacturer and going through the event logs, we were able to determine what happened. What surprised me is the switch gear manufacturer was unwilling to make the change to eliminate this lockout condition even if we were willing to accept all equipment damage that resulted from that decision.
What happens if the generator is brought into the load rather than locking out? In the vast majority of the situations and in 100% those I’ve looked at, the fault is outside of the building and so the lockout has no value. If there was a ground fault in the facility, the impacted branch circuit breaker would open and the rest of the facility would continue to operate on generator and the servers downstream of the open breaker would switch to secondary power and also continue to operate normally. No customer impact. If the fault was much higher in the power distribution system and without breaker protection or the breaker failed to open, I suspect a generator might take damage but I would rather put just under $1m at risk than be guaranteed that the load will be dropped. If just one customer could lose $100m, saving the generator just doesn’t feel like the right priority.
I’m lucky enough to work at a high-scale operator where custom engineering to avoid even a rare fault still makes excellent economic sense so we solved this particular fault mode some years back. In our approach, we implemented custom control firmware such that we can continue to multi-source industry switch gear but it is our firmware that makes the load transfer decisions and, consequently, we don’t lockout.