Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the story he points out that there was no kill switch.

And, as has been found in other disasters in other industries, kill switches are hard to test.



Long ago in a previous life, I worked in a factory that made PVC products, including plastic house siding. One of my co-workers got his arm caught in a pinch roller while trying to start a siding line by himself. There was a kill switch on the pinch roller - six feet away and to his left, when his left arm was the one that was caught. Broke every bone in his arm, right up to his collarbone.

He screamed for help, but no one could hear him over the other noisy machinery. Welcome to the land of kill switches.


Yikes. That reminds me of The Machinist.


It feels more and more like the only responsible way to engineer systems is with a built-in always-on-in-production chaos monkey, to always be killing various parts of them. Normally this is done to ensure that random component failure results in no visible service interuption, but in this situation, you'd also be able to reuse the same "apoptosis" signal the chaos monkey sends to just kill everything at once.


Everything should be written crash-only[1]. That way they don't have to worry about pulling the plug at any time.

http://en.wikipedia.org/wiki/Crash-only_software


Crash-only is nice and all but you can't crash the other side of a socket...

Like you couldn't crash a steel mill controller and expect the process equipment to be magically free of solidified metal. It only means the servers will come back up with a consistent state.


"Crash-only engineering" is a method of systems engineering, not device engineering; it only works if you get to design both sides of the socket.

In the case of a system that needs hard-realtime input once it gets going (like milling equipment), the "crash-only" suggestion would be for it to have a watchdog timer to detect disconnections, and automatically switch from a "do what the socket says" state to a "safe auto-clean and shutdown" state.

In other words, crash-only systems act in concert to push the consequences of failure away from the site of the failure (the server) and back to whoever requested the invalid operation be done (the client.) If the milling controller crashes, the result would be a mess of waste metal ejected from the temporarily-locked-up-and-ignoring-commands process equipment. The equipment would be fine; the output product (and the work area, and maybe the operators if they hadn't been trained for the failure case) would not be.


At first I thought you had written "rocket" instead of "socket", which would also make much sense.


And as a general rule of thumb, what the other end of the rocket should be doing is pointing down.


A bit trivial, but actually a rocket flies sideways, not straight up.

You go straight up you just fall back down - you need to go into orbit which means flying sideways.


Nah. Some times you want it to crash too.


The story also points out that there were no emergency procedures. While not as instantaneous as a kill switch, known good procedures could have significantly reduced the final effect.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: