A typical story in Charles Perrow’s Normal Accidents: Living with High-Risk Technologies runs like this.
We start with a plant, airplane, ship, biology laboratory, or other setting with a lot of components (parts, procedures, operators). Then we need two or more failures among components that interact in some unexpected way. No one dreamed that when X failed, Y would also be out of order and the two failures would interact so as to both start a fire and silence the fire alarm. Furthermore, no one can figure out the interaction at the time and thus know what to do. The problem is just something that never occurred to the designers. Next time they will put in an extra alarm system and a fire suppressor, but who knows, that might just allow three more unexpected interactions among inevitable failures. This interacting tendency is a characteristic of a system, not of a part or an operator; we will call it the “interactive complexity” of the system.
For some systems that have this kind of complexity, … the accident will not spread and be serious because there is a lot of slack available, and time to spare, and other ways to get things done. But suppose the system is also “tightly coupled,” that is, processes happen very fast and can’t be turned off, the failed parts cannot be isolated from other parts, or there is no other way to keep the production going safely. Then recovery from the initial disturbance is not possible; it will spread quickly and irretrievably for at least some time. Indeed, operator action or the safety systems may make it worse, since for a time it is not known what the problem really is.
Take this example:
A commercial airplane … was flying at 35,000 feet over Iowa at night when a cabin fire broke out. It was caused by chafing on a bundle of wire. Normally this would cause nothing worse than a short between two wires whose insulations rubbed off, and there are fuses to take care of that. But it just so happened that the chafing took place where the wire bundle passed behind a coffee maker, in the service area in which the attendants have meals and drinks stored. One of the wires shorted to the coffee maker, introducing a much larger current into the system, enough to burn the material that wrapped the whole bundle of wires, burning the insulation off several of the wires. Multiple shorts occurred in the wires. This should have triggered a remote-control circuit breaker in the aft luggage compartment, where some of these wires terminated. However, the circuit breaker inexplicably did not operate, even though in subsequent tests it was found to be functional. … The wiring contained communication wiring and “accessory distribution wiring” that went to the cockpit.
As a result:
Warning lights did not come on, and no circuit breaker opened. The fire was extinguished but reignited twice during the descent and landing. Because fuel could not be dumped, an overweight (21,000 pounds), night, emergency landing was accomplished. Landing flaps and thrust reversing were unavailable, the antiskid was inoperative, and because heavy breaking was used, the brakes caught fire and subsequently failed. As a result, the aircraft overran the runway and stopped beyond the end where the passengers and crew disembarked.
As Perrow notes, there is nothing complicated in putting a coffee maker on a commercial aircraft. But in a complex interactive system, simple additions can have large consequences.
Accidents of this type in complex, tightly coupled systems are what Perrow calls a “normal accident”. When Perrow uses the word “normal”, he does not mean these accidents are expected or predictable. Many of these accidents are baffling. Rather, it is an inherent property of the system to experience an interaction of this kind from time to time.
While it is fashionable to talk of culture as a solution to organisational failures, in complex and tightly coupled systems even the best culture is not enough. There is no improvement to culture, organisation or management that will eliminate the risk. That we continue to have accidents in industries with mature processes, good management and decent incentives not to blow up suggests there might be something intrinsic about the system behind these accidents.
Perrow’s message on how we should deal with systems prone to normal accidents is that we should stop trying to fix them in ways that only make them riskier. Adding more complexity is unlikely to work. We should focus instead on reducing the potential for catastrophe when there is failure.
In some cases, Perrow argues that the potential scale of the catastrophe is such that the systems should be banned. He argues nuclear weapons and nuclear energy are both out on this count. In other systems, the benefit is such that we should continue tinkering to reduce the chance of accidents, but accept they will occur despite our best efforts.
One possible approach to complex, tightly coupled systems is to reduce the coupling, although Perrow does not dwell deeply on this. He suggests that the aviation industry has done this to an extent through measures such as corridors that exclude certain types of flights. But in most of the systems he examines, decoupling appears difficult.
Despite Perrow’s thesis being that accidents are normal in some systems, and that no organisational improvement will eliminate them, he dedicates a considerable effort to critiquing management error, production pressures and general incompetence. The book could have been half the length with a more focused approach, but it does suggest that despite the inability to eliminate normal accidents, many complex, tightly coupled systems could be made safer through better incentives, competent management and the like.
Other interesting threads:
Normal Accidents was published in 1984, but the edition I read had an afterword written in 1999 in which Perrow examined new domains to which normal accident theory might be applied. Foreshadowing how I first came across the concept, he points to financial markets as a new domain for application. I first heard of “normal accidents” in Tim Harford’s discussion financial markets in Adapt. Perrow’s analysis of the upcoming Y2K bug under his framework seems slightly overblown in hindsight.
The maritime accident chapter introduced (to me) the concepts of radar assisted collisions and non collision course collisions. Radar assisted collisions are a great example of the Peltzman effect, whereby vessels that would have once remained stationary or crawled through fog now speed through. The first vessels with radar were comforted that they could see all the stationary or slow-moving obstacles as dots on their radar screen. But as the number of vessels with radars increased and those other dots also start moving with speed, we have more radar assisted collisions. On non collision course collisions, Perrow notes that most collisions involve two (or more) ships that were not on a collision course, but on becoming aware of each other managed to change course to effect a collision. Coordination failures are rife.
Perrow argues that nuclear weapon systems are so complex and prone to failure that there is inherent protection against catastrophic accident. Not enough pieces are likely to work to give us the catastrophe. Of course, this gives reason for concern about whether they will work when we actually need them (again, maybe a positive). Perrow even asks if complexity and coupling can be so problematic that the system ceases to exist.
Perrow spends some time critiquing hindsight bias in assessing accidents. He gives one example of a Union Carbide plant that received a glowing report from a US government department. Following an accidental gas release some months later, that same government department described the plant as accident waiting to happen. I recommend Phil Rosenzweig’s The Halo Effect for a great analysis of this problem in assessing the factors behind business performance after the fact.