One of the most troubling things about the information emerging about this faulty sensor is the ways in which Metro indicates they might have expected to detect it. John Catoe’s press release from July 1 described the situation somewhat vaguely. “This is not an issue that would have been easily detectable to controllers in our operations control center. What the analytical profile showed was that the track circuit would fail to detect a train only for a few seconds and then it appeared to be working again.” Why it wouldn’t be easily detectable isn’t clear from his statement, but a Washington Post piece from July 2 credited the following information to Metro’s rail chief, Dave Kubicek.
Instead of completely failing, the track circuit “fluttered” on and off so quickly that, Kubicek said, the failure would not have been obvious in Metro’s downtown operations center, where controllers monitor real-time movement of trains by watching an illuminated graphic depiction of the 106-mile railroad.
“It was happening so fast, you would just blink and miss it,” he said. “Realistically, you had to be looking at the exact area at the exact place” at the exact time.
A controller would have to be staring at something the size of “a button on a BlackBerry.”
A fair number of engineers are going to read this section of text and grind their teeth, but the underlying problem isn’t intuitive to most people. If you eavesdropped on a conversation between two grad students considering writing about this situation for a paper you might hear them say something like this:
Metro’s problem here revolves around the challenge in displaying a digital result in an analog method and inability to detecting a problem using insufficiently granular data.
That’s a complicated phrase which you can explain with a $5 table lamp.
Put in plain english, the problem is that they’re talking about trying to use a system designed to tell you one thing to determine another. That table lamp is meant to light up your room, not tell you if your power is flowing steadily. At one time or another we’ve realized that the power is out because a lamp isn’t lit, but that’s a very general determination and Catoe and Kubicek are talking about very exact determinations.
Try a little experiment: if your lamp is one that you can turn on and off by turning a knob in a single direction, twist it as far and fast as you can. If you can do it fast enough and stop in the same on/off condition you started in you might not see a flicker.
You know the switch actually does stop the power from going to the light. Turn it just once and you’re sitting in the dark. But if the period of time where it’s off is so brief that the glowing bulb doesn’t have a chance to dim then you can’t tell it ever happened.
If you’ve got florescent bulbs you can see an even more extreme version of this. Turn the power on and off quickly enough and these slower-to-start bulbs might never get time to put out any light. The power was on. You just couldn’t tell because the indicator you were using – light coming out of the bulb – couldn’t alter its appearance as quickly as the power could be turned on and off again. There was power, but your signaling mechanism didn’t respond quickly enough for you to be able to tell.
For the most part this isn’t a problem. Your lamp isn’t meant to be an instant indicator that power is on or off, it’s meant to give out light. It needs to power up reasonably quickly when you flip the switch but a difference in tenths of seconds isn’t too significant. Similarly, the Metro system uses those lights on their panel to indicate a train is present or not. Perhaps it takes a tenth of a second for the light to indicate a train has entered the segment but WMATA knows how long the delay is before a sensor result is accurate. Presumably they create procedures that build in that delay, like standard train following distances.
You probably do this in your own life. Perhaps you tell your spouse “call me before you leave the office so I can start dinner” because you know it will take them thirty minutes to get there and forty to put dinner on the table. You wouldn’t say that to someone commuting by train from New York City, however, because it doesn’t give you an indication of where they’ll be in the amount of time you care about. In that situation you might say “call me when you pull into Union Station.”
The troubling thing in this case is that Catoe and Kubicek are indicating that the problem couldn’t be detected because the intermittent sensor result happened too quickly for the change to be detected in the signal lights. If that’s the only way this result could have been noticed then that is not a failure in the signaling device, that’s a problem in their error-detection systems.
The lights on that panel do what they’re supposed to do, and have some recognition of complete sensor failure. The problem described here is one of a sensor doing what it’s supposed to do – indicate that a train is or is not present – but doing it at a speed that makes no sense. It’s not plausible for a train to be above the sensor one second, not there the next, then back again a second later. It’s perfectly reasonable for a train to be present, gone two minutes later, then present again two minutes after that.
Not only is it not reasonable to try to detect this kinds of problem with that display system, but a flat panel showing the indications of where trains are at this moment doesn’t accurately duplicate what happens with trains and how they interact with the same data,
Here’s a section from a 1996 NTSB document from 1996 regarding a train collision in Shady Grove[pdf]. On page 16 it describes the behavior and how these WEE-Z systems operate.
Tuned impedance devices known as WEE-Z bonds provide block separation. These WEE-Z bonds inject into the track coded AF signals that detect the presence of a train in the block and automatically transmit limiting and regulated speeds to passing trains. There is generally one track circuit per block, with WEE-Z bonds located at each end of each track circuit.
The pertinent point there is that it signals information to passing trains, meaning that once the train has passed the sensor it no longer matters what the sensor says. The controllers downtown are seeing information from now, minus whatever delay exists in the system. A passing train’s information is from a particular instant and its age is dependent on how long ago it passed over that segment.
As before, this isn’t necessarily a problem; each system sees what it needs to see to accomplish what needs to be done. Contollers need to see the status of all segments in as close to real time as possible. A train is only concerned with the condition of the track segment immediately in front of it. However if the train sees that information at the instant it is inconsistent it doesn’t matter what it said before and after, so the observations of the controllers isn’t useful unless they look at that same instant that the train did, as well as before and after.
To be useful information in spotting this problem a person or system has to watch the output from this sensor over the course of several seconds. The failure is the impossible condition of a train quickly being there, then not there, then there again. That’s not something you can indicate with a single light that glows on or off, it’s a change over time.
It’s too soon to know exactly how this sensor played into the crash, or if it did at all. However the claim that the situation couldn’t reasonably be noticed by controllers downtown is worrisome. Expecting those systems to serve a secondary, unintended purpose by requiring humans to notice unusual patterns in lights isn’t good practice. If the system’s signals can be detected and used to show a display then they can be monitored for that impossible result and other systems can monitor them for this kind of failure.
It’s possible this is an unknown failure mode for those sensors – a way that nobody has seen them malfunction before – in which case it’s understandable that it wasn’t something that we could expect WMATA to have systems watching. However if the controllers could be expected to see an odd blinking behavior with longer intervals and report it… that implies this is something they knew they needed to worry about. I hope for everyone’s sake that’s not the case.