The WMATA Single Point of Failure? Yes, But…

”
courtesy of ‘Chris Rief aka Spodie Odie’

This morning’s power failure at Metro’s data center that took their PA system, payment processors, NextBus and other non-critical services offline this morning. Specifically, a power distribution unit (PDU) at their datacenter failed, which took offline a bunch of Metro’s servers. So, what’s that mean? Shouldn’t they have protection against that sort of thing? I asked these questions of Chuck Goolsbee, VP of Technical Operations at Digital Forest in Seattle. Here’s what he said:

WLDC: Can you talk to me about what kind of power usually goes into a data center, and how you guys avoid this kind of problem?

CG: well… every facility is different, but generally an average datacenter has a three-phase power feed, usually 480v, but occasionally 208v, sometimes even higher. Higher voltage is more efficient at traveling distance. From the utility feed to the servers takes a pretty standard path: There are usually two sources: the electrical grid (Utility) and a backup (generators). They come together at a common point called a transfer switch. These are usually automated, using sensors to switch from one source to another. So, when the Automated Transfer Switch (ATS) senses a loss on utility power, it sends a signal to the backup system to start, and then transfers the path over to the backup source. Backup Generators do not run all the time, for obvious reasons.

WLDC: Sure, Diesel isn’t free.

CG: And they take anywhere from 5 to 15 seconds to come alive and provide full power. So, downstream from the ATS is a UPS (Uninterruptable Power Supply), this acts as a temporary source of power while the backup spools up. It can be any short-term power source, batteries, flywheels, drop-towers, whatever. They are designed to run for only as much time as the backup needs to come on. To maintain simplicity and efficiency all of the above systems are the same voltage (our systems are all 480v). Downstream from the UPS is a device called a Power Distribution Unit or PDU.

WLDC: That’s what failed today.

CG: Right. These are analogous to an gig/100/10 ethernet switch. They take a 480v upstream feed and step it down to 208v or 120v or even 48v DC if you have the right machine. They can even do 480v 3-phase straight through. The point is they adapt the power to the circuit you need. They are the last stop of power distribution before it gets to the racks where the servers are. They contain transformers and breakers. From the breakers circuits of specific phase and voltage go out to the racks. It is DOWNSTREAM from the backup sources. Unless you have a UPS downstream from it, which is bad design, you can not recover from a failure at this level.

WLDC: A failure in the PDU would be a kill switch to all the machines therein.

CG: Yes. Unless, that is, you run FULL A/B power, that is fully redundant power paths that never meet EXCEPT at servers with multiple power supplies, but that is 4X more expensive to build and maintain. Hyper-critical systems get A/B. FULLY redundant systems are phenomenally expensive, way more than the market will bear. Very few people will pony up the cash for it because, like I said, it costs 4X more.

WLDC: This wasn’t a complete Systems Fail, People Die problem.

CG: Exactly.

Tom Bridge

I live and work in the District of Columbia. I write at We Love DC, a blog I helped start, I work at Technolutionary, a company I helped start, and I’m happy doing both. I enjoy watching baseball, cooking, and gardening. I grow a mean pepper, keep a clean scorebook, and wash the dishes when I’m done. Read Why I Love DC.