I have gained such respect for the FAA and the NTSB in that *every* crash is fol...

brians · on Feb 24, 2019

Not root cause analysis. Complex systems fail in complex ways. If you want to read more about how this is done, try Normal Accidents (Perrow), Human Error (Reason)... but you're going to land on Engineering a Safer World (Leveson) eventually. You can get a free copy from the author's web site: http://sunnyday.mit.edu/safer-world/index.html

There are tech shops that use these sorts of techniques for every major incident, and for collections of related near-misses. If you want to improve past 4 9s reliability, or if you're considering taking traffic that could wreck a life if it availability or secrecy fail, that only seems appropriate.

poof131 · on Feb 24, 2019

The Navy likes to keep things simple and called it the Swiss cheese model.[1] It’s usually never a single factor, but a combination of factors. The holes line up and an accident happens.

I think it’s also important to emphasize that getting to this level of safety is so much more cultural than technical. The ability to be open about failure and for people to feel safe in communicating to investigators is critical. The Navy would run two parallel investigations into an incident, one focused on Safety, where anything said was confidential and couldn’t be used in the administrative investigation (FNAEB) that could result in career ending consequences.[2]

Though I worry a bit with the constant wars and punishing deployments this culture may be heading in the wrong direction. [3] One of my XOs said, “One generation of Admirals will make their names breaking the Navy, and the next will make their names putting it back together.”

I have immense respect for the FAA and NTSB. They are trying to infuse this safety ethos into the fledgling drone industry. No easy task.

1. https://en.wikipedia.org/wiki/Swiss_cheese_model 2. https://www.airwarriors.com/community/threads/my-fnaeb-exper... 3. https://www.theatlantic.com/technology/archive/2018/10/the-n...

ACow_Adonis · on Feb 24, 2019

I just wanted to reiterate that I've found the Swiss cheese model one of the simplest and most effective communication tools for thinking about and communicating to laypeople how various events happen where they want to attribute 'cause' to a single factor, decision, etc. In my experience it can be doubly difficult to get people to think in this way once culture has got them thinking differently (I despise regression analysis now, because it breeds a layperson culture of both an illusion of understanding, and that individual variables really have individual coefficients that can be simply manipulated).

My first introduction to it was watching a documentary on air crash investigation as a child, and I've never forgotten it since.

I don't even work in 'safety' per se(currently I do analytical and data work for banks and financial regulation, so one could argue that is a kind of safety), but it just keeps coming up again and again because in the real world, most big human disasters are multi-causal chains, because evolution (in a social sense) will breed out those disasters that are both big in their detriments and simple in their causes.

neurotech1 · on Feb 25, 2019

The JAGMAN investigation is the legal/administrative investigation.

The Safety Investigation Board is separate, and privileged.

FNAEB investigates the crew, and usually follows a Class A mishap ($2m damage or hull loss, or death/permanent disabling injury) but can be triggered by near-miss or pilot flathatting etc.

A friend is a retired Naval aviator, said the #1 reason aviators get permanently grounded from the fleet, is refusal to accept responsibility for mistakes that they were clearly responsible for. Things like forgetting to set the altimeter correctly before takeoff, causing a serious near-miss or pilot deviation. Honestly and fully admitting their mistakes, is more likely to result in a FNAEB returning them to the fleet. Asoh Defense comes to mind.[0]

[0] https://en.wikipedia.org/wiki/Japan_Airlines_Flight_2#The_%2...

brians · on March 2, 2019

That’s a great example of a safety practice that nets out unsafe: better to ask why the altimeter requires setting and the plane can be started without it!

poof131 · on Feb 25, 2019

Thanks for the reminder, three investigations. It’s been awhile and FNAEB was always a terrifying word that imprinted on my brain. Accepting responsibility was definitely a key part of Naval aviation.

Robotbeat · on Feb 25, 2019

I do wonder about balance in drone safety, though...

...for instance, Zipline is doing amazing things in Rwanda improving the medical (okay, blood) logistics system drastically. They're legitimately saving lives regularly. But what they're doing is practically illegal in the US due to the way safety regulations are put together (although they have indeed achieved a very high degree of safety and work closely with ATC, etc... it's not the wild west). Some of that is logical... Rwanda is a developing country with a great need that overcomes a lot of safety concerns. But doubtless lives could be saved if similar drones were allowed in the US.

Second: FAA regulations have already slowed the development of electric aircraft in the US which has significant climate consequences and thus can indirectly lead to lives lost...

...that all said, I think the NTSB and FAA do a good job. Particularly the NTSB.

poof131 · on Feb 25, 2019

Zipline is doing good work. I think we will get there, it’s just going to take time. Acting FAA administrator Dan Elwell gave a great speech at InterDrone last year about the safety aspect.[1] Personally, I think in some ways parts of industry are slowing down progress. State and local will be key players, the FAA knows this, but industry seems to believe they shouldn’t be involved and below 400’ should be like class e & g airspace with no rights for property owners. If the Uniform Law Commission settles on 200’ that could be a good thing. [2] Then assuming the FAA authorizes a system as safe, private property owners could conduct BVLOS flights over their property (i.e. ranches, mines). Or assuming state and local are involved, they could authorize flights over public routes. The airspace below 400’ will more likely resemble class A airspace than anything else and will have to involve state and local for planning. No magical UTM solution will solve it alone. [3] It will be a combination of technology and operations.

[1] https://www.interdrone.com/news/dan-elwell-speaks-to-audienc...

[2] https://unmanned-aerial.com/drone-industry-responds-to-draft...

[3] https://www.utm.arc.nasa.gov/upp-industry-workshop/UTM%20PP%...

nradov · on Feb 25, 2019

The US already has a good blood distribution system so I doubt many lives would be saved by adding drones. Our major problem is convincing enough eligible people to donate.

Robotbeat · on Feb 25, 2019

The US, like the rest of the world, has about a 7 percent spoilage rate. Rwanda, due to the just-in-time drone delivery network covering the ~entire country, has virtually zero. And rural hospitals in the US have more of a logistics challenge than you might think. Quality and access to care suffers a lot, and maternal mortality has actually gone up in recent years. Something like Zipline could significantly help.

yesenadam · on Feb 24, 2019

>Though I worry a bit with the constant wars and punishing deployments this culture may be heading in the wrong direction.

Maybe 'constant wars' is already the wrong direction.

bayindirh · on Feb 24, 2019

If anyone interested in the book, its webpage is moved.

New MITPress web page is: https://mitpress.mit.edu/books/engineering-safer-world

Since the book is open access, the direct link to the book (again from MITPress webpage) is https://www.dropbox.com/s/dwl3782mc6fcjih/8179.pdf?dl=1

hlieberman · on Feb 24, 2019

Engineering a Safer World was a truly career changing book for me, and I am so thankful for bts and mstone for introducing it to me.

ematvey · on Feb 24, 2019

Would you say it is highly relevant for those interested in AI safety?

archgoon · on Feb 24, 2019

What's your threat model when you say "AI safety"? Which scenarios are you attempting to prevent?

jmkni · on Feb 24, 2019

Imagine if we did that with software generally.

Every time there's a data breach, an agency would investigate with the same thoroughness they do in aviation.

Then, National Geographic could have a spin off show called Data Breach Investigation (their aviation show is called Air Crash Investigation in the UK, I think it's called different things in different Countries.)

A man can dream...

seanmcdirmid · on Feb 24, 2019

We would be limited to 20 pieces of software that were very reliable.

The problem with software, which makes it so unlike hardware, is that we can use so much of it, the diversity is huge.

cryptonector · on Feb 24, 2019

> ... software, ... unlike hardware, ...

Everything is software now. So much firmware (which is software), and so much hardware that cannot run without software. Your refrigerator, your dishwasher, your car, airliners -- the trend is for everything to be software now.

seanmcdirmid · on Feb 25, 2019

We can build reliable software when we want to. It just has limited scope and/or undergoes more expensive development procedures.

It is easy to build a finite state machine for a dishwasher that is fairly reliable. Likewise, consumer jet code that hits safety considerations is heavily tested and audited...and the FAA cares about it whenever it causes a plane to crash, leading to a lot of regulations on how the code is vetted.

nkozyra · on Feb 24, 2019

Sure, software is everywhere, but the physical properties of hardware are fundamentally different when it comes to QA, risk analysis, general testing.

cryptonector · on Feb 24, 2019

But when all your hardware has firmware, you've lost that advantage.

danmaz74 · on Feb 24, 2019

Depends how complex the firmware is.

WalterBright · on Feb 25, 2019

The trick is not to create reliable software. The trick is to create systems that can tolerate complete failure of the software. This is the secret to airplane reliability.

(Although they still do their best to make the software reliable, they do not bet lives on its reliability.)

Safe Systems from Unreliable Parts https://www.digitalmars.com/articles/b39.html

Designing Safe Software Systems part 2 https://www.digitalmars.com/articles/b40.html

mdturnerphys · on Feb 24, 2019

Your comment reminds me of this article about the Space Shuttle code that I come back to every few years: https://www.fastcompany.com/28121/they-write-right-stuff

cmroanirgo · on Feb 25, 2019

This popped up recently: https://news.ycombinator.com/item?id=19180280

It's about a guy doing similar things, with model rocketry.

It seems though that the resiliency is all due to field tests/ spectacular failure in the latter. (It doesn't detract from the project at all though).

cheerlessbog · on Feb 24, 2019

I'm curious whether they applied the same level of scrutiny to the libraries, compilers, and operating system their code used. And firmware.

rpeden · on Feb 24, 2019

All of that probably has to be done in order for the AP-101 to be flight certified. As far as I can recall, they wrote their own operating system and didn't rely on any external libraries.

Even when it was new, the AP-101 was part of a family of flight computers that NASA had already been using, and the same is true for the HAL/S compiler used for most of the code.

If you want to explore how it all worked, this SE thread contains some great information and also serves as a good launching point for further research:

https://space.stackexchange.com/questions/19006/what-operati...

avn2109 · on Feb 24, 2019

People named a computer HAL, and sent it to space??

edoceo · on Feb 25, 2019

One of the tools that helped people and other computers into spaces was named, in part, HAL

cmroanirgo · on Feb 25, 2019

I like the sentiment, but there are big differences.

In software, you can poke/prod away at the running application to get it to do something it shouldn't. To me, this is a bit like yanking wires that you know are important to see what happens to plane in mid flight. Another anecdote I use is to image pulling the steering wheel off a car while someone's driving it.

I think modern software testing and security analysis is probably far away better than what the ntsb provides because we've already systems in place to cope with crashes.

megaremote · on Feb 24, 2019

Imagine if we did that with car crashes, which kill 40,000 people every single year in the USA.

ralph84 · on Feb 25, 2019

NTSB investigates hundreds of "major" highway accidents per year. Typically accidents with multiple fatalities and at least one of the vehicles being operated for hire. For example NTSB investigated the Schoharie limousine crash where 20 people died.

Graduated drivers license laws for young drivers, age-21 drinking laws, smart airbag technology, rear high-mounted brake lights, commercial drivers licenses, and improved school bus construction standards all came about at least partially due to NTSB recommendations.

zamfi · on Feb 24, 2019

This is a good discussion point: we could do this with car crashes, but I would guess that most causal factors can be summed up as "driver inattention".

How would you solve that? The FAA can issue all sorts of regulations requiring rest breaks, fatigue detection, etc. -- be sure you're ready for that level of intervention before you ask for this level of investigation.

ACow_Adonis · on Feb 25, 2019

Car crashes are actually the perfect example, because car fatalities aren't just a function of engineering a safe and sturdy car, and human inattention both can't be eliminated, but also isn't independent of design.

For instance, how many people ride in the car? What is the average car trip length? Why and when are people driving? Speed profiles, line of site in both car and environment. Does traffic all flow one way, can a car ever land in incoming traffic if it loses control, are traffic barriers designed to bring an out of control car safely to rest with good line of sight to the accident? Are features keeping drivers alert or distracting them?

The difference between lots of fender benders with a bit of panel beating and constant fatalities can come down to multiple causes, many outside of engineering a safe car.

post_meridiem · on Feb 25, 2019

"be sure you're ready for that level of intervention before you ask for this level of investigation."

There's no good reason not to, for example, have every car use a breathalyzer to start the engine -- those driving drunk on occasion would protect themselves from DUIs, lives would be saved, etc. Or to implement other, similar deep regulations. But the ideology around cars ("freedom"), vs. planes ("safety"), is too different to allow for such levels of intervention.

nradov · on Feb 25, 2019

The good reason not to install a breathalyzer in every car is that I don't drive drunk and would vote against any politician foolish enough to suggest something so intrusive.

pureliquidhw · on Feb 25, 2019

You don't drive drunk, but other people do. breathalysers are only put in AFTER an offense. To eliminate the "first generation" drunk driving accidents would require preemptive installation of breathalysers on every car.

The way they are implemented now would be similar to only installing seatbelts in cars where the driver has already been in an accident. Similar sentiment was around when seatbelt laws were put in place, "I'm a good driver." but the seatbelt protects you from yourself crashing as well as when someone else crashes into you!

In the same way, breathalysers (ignition interlocks) for everyone wouldn't be "for you" as much as for everyone to prevent them from drinking and crashing into you.

I hadn't thought about widespread ignition interlocks, and obviously there would be a lot of public pushback, and it would be quite intrusive, but I could see it helping reduce traffic fatalities dramatically.

nradov · on Feb 26, 2019

I don't care how much it would reduce traffic fatalities. I am absolutely opposed to this level of nanny state overreach. And I trust that enough other Americans agree with me to prevent this kind of nonsense from making it into law.

What I would support is a further reduction in the allowed BAC, and stricter enforcement of traffic laws in general.

post_meridiem · on Feb 25, 2019

If you don't drive drunk, what do you lose from starting the engine via breathalyzer, rather than via key turn? Why is one more intrusive than the other? Most governments already require you to carry insurance, have a license/training to drive, wear seat belts -- in what way is this different?

WalterBright · on Feb 25, 2019

Cars could never be invented in today's legal environment.

coldacid · on Feb 24, 2019

About 80% to 90% of reports would state it was operator error on the part of one or more of the drivers involved. Cars are pretty safe when people aren't drooling, lead-footed maniacs.

magduf · on Feb 25, 2019

Huh? 80-90% More like 99.x%. Modern cars are extremely reliable, and even if they do fail (usually due to poor maintenance), that rarely causes a crash, since a disabled car can easily pull over to the side of the road, unlike aircraft.

Almost all crashes are ultimately caused by driver error. Doing FAA/NTSB-style investigation on them would be pointless, because that's exactly what they'd find: people were driving recklessly, inattentively, etc. And there's nothing that can be done about that, because we as a society absolutely refuse to have serious driver training and any standards for driver conduct. We can't even agree on whether the left lane is for passing or not! Try driving in Germany sometime and you'll find that what we accept from drivers here in the US is really quite awful.

ryandrake · on Feb 25, 2019

To be fair, probably 80-90% of NTSB aviation investigations final reports contain the words “pilot error”.

neurotech1 · on Feb 25, 2019

The term "pilot error" is outdated, and suggests blame. "Human performance factors" is usual way to describe such mishaps.

blobbers · on Feb 24, 2019

That seems like a pretty good idea. Federally mandated breach investigation when more than X people are compromised. Guidelines are put in place that can be audited.

michaelbuckbee · on Feb 24, 2019

Among the many different things required of the GDPR, CCPA and US state-level data privacy regulations are:

1. Data Breach disclosures 2. Reporting/Public statements if the vulnerability or path that caused the breach has been fixed.

Note, this might seem a little weird (like "of course we've fixed the breach!") but often data breaches aren't detected for _months_ after they occur. There's plenty of companies who didn't know they were breached until they were on HaveIBeenPwned.

It's not NTSB level thoroughness, but I think it's a healthy start in that direction.

carlivar · on Feb 24, 2019

No agency needed. Companies do this internally already... right?

cptskippy · on Feb 24, 2019

If and how extensively is culturally driven in an organization. Regulations, regardless of who imposes them, are almost always necessary. The unfortunate thing is that so many regulations are formed without support from the consumers of those regulations which is why they are often backwards or onerous.

Gun regulations are a great example of good intentions, bad implementation due in part to ignorance, and industry resistance/interference. Another is PCI which is a self regulation that's often to vague or specific to be useful.

mruts · on Feb 24, 2019

Honestly that sounds like hell. I sure as hell didn’t become a developer so I could get thrown in jail for buggy code. Building software is something no one really knows how to do and we are still at the infancy of our field. I don’t think it would be fair to impose the same standards thst real engineers have on such a nascent and choatic field.

raesene9 · on Feb 24, 2019

The software industry has been a thing for over 60 years now, with the first commercial software company being founded in 1955 (https://en.wikipedia.org/wiki/Software_industry#History).

The first commercial aviation company was formed in 1909 (https://en.wikipedia.org/wiki/Airline#History), the FAA was founded in 1958 and the NTSB in 1967....

At some point the "we're in our infancy" stage has to end.

trevyn · on Feb 24, 2019

The guidelines for safe aviation have way lower Kolmogorov complexity than the guidelines for safe “everything that software can do”-iation. It’s actually close to a subset when you think about it.

mruts · on Feb 24, 2019

Well ideally it would end when we know what the fuck we are doing.. the number of bugs I find in all sorts of software (with Apple being one of the worst offenders) is frankly embarrassing and a pox mark of the entire industry.

If planes were as reliable as software they would be the leading cause of death in the world.

dchichkov · on Feb 24, 2019

I would recommend reading "Developing Safety-Critical Software: A Practical Guide for Aviation Software and DO-178C Compliance".

bdamm · on Feb 25, 2019

The software planes use is reliable.

mikeash · on Feb 24, 2019

We know how to build reliable software. When it really matters, because lives are on the line or because failures cost real money, it is done.

What we don’t know is how to write reliable software while also delivering an MVP for antsy VCs and shipping feature updates every week.

And that’s fine. Not all software needs extreme reliability. The problem is that a lot of software gets categorized as “doesn’t need it” when it really shouldn’t be, like all those systems holding sensitive personal information.

javagram · on Feb 24, 2019

One could argue that it’s precisely because there are no standards that the field is so chaotic.

I’m not sure what the right balance is - the more standards and compliance there are, the more innovation suffers. On the other hand, without standards or legal responsibility, it makes sense for companies to play fast and loose with security, privacy, and stability in favor of features - a consumer is going to pick a better product based on UX, features, and cost, not based on security they have no way of judging.

bilbo0s · on Feb 24, 2019

>I sure as hell didn’t become a developer so I could get thrown in jail for buggy code...

You'd better stay away from any company that requires FDA approval. Believe it or not, every change is signed. By you. By the FDA compliance person. Etc etc. I'll give you two guesses as to how they decide who goes to prison if God forbid something in the software is shown to have caused a fatality? Or, worse, a series of fatalities?

That sort of regulation happens even in the software industry. It just depends on the purpose of the software.

AnimalMuppet · on Feb 24, 2019

I don't think developers get thrown in jail, even for a bug that kills multiple people. The one time I've heard of people getting thrown in jail it was upper management, for trying to game the FDA audit and approval process. They got led out the door in handcuffs by US Marshalls.

Source: Worked in FDA-regulated software for six years, at two different places.

yesenadam · on Feb 24, 2019

>I'll give you two guesses as to how they decide who goes to prison if God forbid something in the software is shown to have caused a fatality?

Would you mind just telling us instead? Thanks.

rleigh · on Feb 24, 2019

See IEC 62304:2006 (https://www.iso.org/standard/38421.html) and the related standards which govern the formal processes which medical device software must meet for FDA and CE approval (as well as other national standards).

The whole process from requirements, specifications, high and low level design, implementation, validation and verification and the rest of the lifecycle requires stringent oversight, including documentation and signoffs. The signatures on those documents have legal meaning and accountability for the engineers who did the analysis and review at each stage.

roel_v · on Feb 24, 2019

Look at this from an economics perspective, do people in this sort of positions get paid more than comparable positions in other fields? I'd never heard of legal liability for software engineers. Has anyone ever been convicted or otherwise been held accountable when there was a failure?

mruts · on Feb 24, 2019

That’s really funny actually, I used to work as a developer for the FDA for their researchers. But since we were just publishing meaningless papers, I was probably safe.

nn3 · on Feb 24, 2019

Nobody is thrown in jail ever for a airplane failure analysis. Every human failure is considered a process failure, that gets fixed by improving the processes or the machinery

tzs · on Feb 25, 2019

> Nobody is thrown in jail ever for a airplane failure analysis

...in the United States. I definitely recall watching multiple episodes of Air Disasters [1] where surviving pilots in crashes in other countries that killed passengers went to jail for things that in the US would have at most resulted in a suspended license and demotion or termination from their airline.

For example, Air France Flight 296 in 1988. Another example is Gol Transportes Aéreos Flight 1907 in 2006 in Brazil.

[1] AKA Air Emergency, Mayday, or Air Crash Investigation, depending on what country you watch in and what channel it is on.

pmyteh · on Feb 24, 2019

...and the accident investigation data and reports are expressly restricted from being used in legal proceedings (Chicago Convention, Annex 13) with the intention that the process is for preventing accidents, not for prosecuting scapegoats.

jessaustin · on Feb 24, 2019

If the investigations of any particular mishap were to indicate that some particular person or persons had been criminally negligent, it would be no surprise to see those people prosecuted eventually. If another team of investigators has to write a different report, they will.

vonseel · on Feb 24, 2019

Not sure if I agree that no one really knows how to do it, but yeah, software is not as disciplined as something like civil engineering. And for most things, it doesn’t need to be. For others, their teams operate much more slowly and with more testing and QA - think of a basic web app developer versus someone writing software for Boeing aircraft. The jobs are completely different. For most web app developers, screwing up doesn’t endanger anyone’s life.

mruts · on Feb 24, 2019

I mean we can kind of make software, but it always has bugs and even the biggest companies who have (supposedly) the best developers consistently make products that don’t work very well at all. Software qualiry looks really bad compared to, say, bridges or cars or airplanes.

ethbro · on Feb 24, 2019

Respectfully, I call bullshit on this.

We know some very good ways to build reliable software. It's just that most industries can't / don't want to pay the costs associated with doing so.

The don't-know / don't-want distinction is functionally irrelevant, but I have a minor tick about it because asserting the former is often used to avoid admitting to the later.

Legacy IBM and ATT, for all their cluster%_$^!s, were amazing at engineering reliable systems (software and hardware). Maybe that's not practical nowadays, outside of heavily regulated industries like military / aerospace, because markets have greater numbers of competitors. But we do know how.

An accurate and modern truth would probably be "We don't know how to quickly build reliable software at low cost."

mruts · on Feb 24, 2019

I mean, clearly, if we really wanted to, we could build all of our software with Coq and have it be formally verified. But we don’t. The Curry Howard isomorphism proves that code and math are equivalent and since we can prove a bunch of stuff with math, we should also be able to prove that our software systems are big free.

And so you’re right, maybe my statement is a little bit of a simplification, but not by much. The amount of man hours it would take to prove that the Linux kernel had no bugs according to some specification would require an absolutely prohibitive amount of time. As such formal verification is so prohibitive that only the most trivial systems could ever be proved.

This is pretty similar to saying that we don’t know how to build bug-free code. We need a paradigm change for things to get better, and I’m not sure it will happen or if it’s even possible.

AnimalMuppet · on Feb 24, 2019

I think of it in slightly different terms. Yes, we could build software that reliably. (Perhaps not totally bug free, but, say, with 1/100th of the bugs it currently has.) But that takes a lot more time and effort (say, 100 times as much, as a rough number). That means, because it would take 100 times as much effort to produce each piece of software, we'd only have 1/100th as much software.

Would that be a better world?

Sure, I hate it when my word processor crashes. On the other hand, I really like having a word processor at all. It has value for me, even in a somewhat buggy state, more value than a typewriter does. Would I give up having a word processor in order to have, say, a much less buggy OS? I don't think I would.

I hate the bugs. But the perfect may be the enemy of the "good enough to get real work done".

ethbro · on Feb 24, 2019

I look at that as the efficiency distinction.

It goes without saying that if we had more time, there would be fewer bugs.

But there are also things (tooling, automation, code development processes) that can decrease the amount of bugs without ballooning development time.

Things that decrease bugs_created per unit_of_development_time.

Automated test suites with good coverage, or KASAN [1, mentioned on here recently]. Code scanners, languages that prohibit unsafe development operations or guard them more closely, and automated fuzzing too.

[1] https://www.kernel.org/doc/html/v4.14/dev-tools/kasan.html

henrikeh · on Feb 24, 2019

It is not about tools. It is about culture and process. Look at something like how SQLite for a real example of living up to the requirements of aviation software: 100 % branch coverage, extensive testing, multiple independent platforms in the test suite etc etc.

You don’t need anything fancy – but you do need to understand your requirements and thoroughly simulate and verify the intended behavior.

Klathmon · on Feb 24, 2019

Do you know that first hand, or is this a case that you can see the cracks in the software industry because you are part of it?

I worked as a mechanic years ago, and I feel the same way about cars as you (and I) do about software. Everything is duct taped together and it's a miracle any of it works as well as it does.

Safety regulations and tests are gamed to all hell, reliability almost always seems to come from an evolutionary method of "just keep using what works and change what doesn't" with hastily tacked on explanations as to why it works, and marketing is so divorced from reality that people assume significantly more is happening in the car than really is.

Now I don't know anything about civil engineering or aricraft, but I have a hunch it's not as well put together as it may seem at first glance.

I still think that software is behind those other fields, and I'm honestly still on the fence if regulations would significantly help or would just slow things down, but it's absolutely the "wild west" in software.

mruts · on Feb 24, 2019

I'm not a "real" engineer, but many of my cousins are (civil engineers) and everything about a bridge, say, is modeled and proven ahead of time.

I work as a developer and everyday I feel like this isn't the right the way to do it. I'm not sure what the right way is, but being a software developer I probably have less faith in it then the general population. Something just feels wrong about it, how we do things..

User23 · on Feb 24, 2019

Imagine building a bridge without a specification. It would be criminally irresponsible. It has been known for decades now that proper specification is absolutely necessary to produce correct software. This is practically a tautology, because there isn't a formally meaningful way to define "correct" besides "satisfies some specified behaviors." Good unit testing does this in an ad hoc way, because the tests serve as a (usually incomplete) specification of the desired behavior. That said we must remember that testing can never prove the absence of bugs, only their presence. The only practicable ways to prove the absence of bugs in nontrivial programs are automatic formal verification or constructive proof using formally defined semantics.

Of course for most commercial software correctness is so irrelevant that it's left undefined. All that matters is pleasantness, that is to say does it generally please the user, perhaps by attracting customers or investors.

ummonk · on Feb 25, 2019

Airplanes have software. It works very well without bugs. Because that is what is demanded in the aviation industry.

fma · on Feb 24, 2019

Look at the Equifax hack. Vulnerability was known but no one cared to apply the patch. In engineering, you chose 2 amongst being better, faster or cheaper.

It's clear in IT we always choose faster and cheaper. No one went to jail for the Equifax hack except that one low level manager for insider trading.

sidlls · on Feb 24, 2019

This industry has been around for 60 years, and has had the benefit for that entire time of the experience from the previous several hundred years of people developing what is modern engineering. It isn't really young or nascent anymore.

There are large sectors in this industry where rigorous engineering effort isn't really applicable. It seems to me, though, that this has effectively been used as an excuse to avoid any genuine rigor. It isn't helpful that so many programmers have such huge egos.

heavenlyblue · on Feb 24, 2019

Then you shouldn’t be a developer?

mruts · on March 1, 2019

Do you have a better way of doing things? Because you formally prove that the code you write has no bugs? Give me a break. You are as bad as the rest of us. Maybe you haven't realized how bad you are, but that just means you have some learning to do.

technofiend · on Feb 24, 2019

If you enjoy reading about risk and root cause analysis in general try the venerable Usenet group comp.risks or read the digest web version at http://catless.ncl.ac.uk/risks/. It's one of those internet sinkholes that can eat hours of your time.

trikonasana · on Feb 24, 2019

Check out The Design of Everyday Things by Don Norman too. He talks very highly of the NTSB and their accident investigation reports. The root cause isn’t usually one single error, but a series of mistakes that collectively caused the accident.

epivosism · on Feb 24, 2019

I love this newsletter but they switched to sending out content wrapped in pre tags with baked-in newlines a few years ago and the value of fighting through wasn't worth it.

vonseel · on Feb 24, 2019

Why does this matter? Screen readers or something else?

epivosism · on Feb 28, 2019

The newlines are baked into the rss feed too, which breaks my newsreader. I've talked to the maintainer but they didn't want to change the stack they use to produce it.

avar · on Feb 25, 2019

The safety cultures of agencies like the FAA, NTSB and international counterparts such as the EU's EASA is amazing, but not without its downsides.

Certification costs have been going up as a portion of overall development costs[1], this leads to more expensive planes, less competition and technological development.

There's some amount of safety that's counterproductive, since people will just opt for e.g. driving which is much less safer, but the way these agencies are set up means they can't ever address that question. They're never going to declare a system as unreasonably too safe and expensive.

So we really should be careful to wish that aviation safety be applied to other industries.

1. https://aviationweek.com/commercial-aviation/what-certificat...

ddevault · on Feb 24, 2019

The space industry already has a long-established tradition of doing similar failure investigations.

montenegrohugo · on Feb 24, 2019

Yeah, I really don't get his comment there. Whenever there is a crash, the launch provider usually does not launch RUD is fully investigated. The analysis is equally exhaustive (if not much more so) than in the aero industry.

garmaine · on Feb 24, 2019

The space industry already does this. In fact sometimes the same people and agencies are involved. It can take years of investigation to return to flight after a serious accident of unknown cause.

fouc · on Feb 24, 2019

I'd be interested in hearing about it applied to totally different industries, particularly software.

hoorayimhelping · on Feb 24, 2019

John Allspaw applied concepts from The Field Guide to Understanding Human Error to software post mortems. When I was at Etsy, he taught a class explaining this whole concept. We read the book and discussed concepts like the Fundamental Attribution Error.

I've found it very beneficial, and the concepts we learned have helped me inn almost every aspect of understanding the complicated world we live in. I've taken these concepts to two other companies now to great effect.

https://www.amazon.com/Field-Guide-Understanding-Human-Error...

https://codeascraft.com/2012/05/22/blameless-postmortems/

https://codeascraft.com/2016/11/17/debriefing-facilitation-g...

https://www.oreilly.com/library/view/velocity-conference-201...

rytor718 · on Feb 24, 2019

So much fantastic reading I hadn't seen before in this discussion. Mucho thanks to all for sharing!

_greim_ · on Feb 24, 2019

Incidentally, Amazon does this. Particularly, they've adopted into the culture the concept of "good intentions don't work, mechanisms do." A decent summary can be found at this link:

http://nickfoy.com/blog/2018/4/7/good-intentions-dont-work

For example, if an admin bungles a copy/paste shell command and it causes an outage, instead of punishing the admin for not wanting hard enough to do it correctly, they'll change the process so it doesn't rely on admins copy/pasting shell commands.

jmkni · on Feb 24, 2019

That's awesome. I'm guessing it's limited to software Amazon themselves develops and doesn't extend to software deployed to AWS?

sokoloff · on Feb 24, 2019

How could it? AWS will happily run whatever code you want, using whatever good or bad development practices you care to choose.

jmkni · on Feb 24, 2019

Fair. More just wondering if that have any sort of process for investigating data breaches involving software hosted on their servers.

jen20 · on Feb 25, 2019

Look at the “shared responsibility model” - if someone got in through the host then I guess they’d investigate. If you leak your SSH private key it’s on you.

sk5t · on Feb 24, 2019

It’s not their software to investigate.

meesterdude · on Feb 24, 2019

if you're into process safety, the CSB has a number of youtube videos on disasters and their causes https://www.youtube.com/user/USCSB

magnamerc · on Feb 25, 2019

As someone working in the aircraft industry, I hope to see this approach with motor vehicles. I think every car should have a 'black box' as well as mandatory dash cams.

magduf · on Feb 25, 2019

Strangely enough, dash cams are actually illegal in many countries (namely the EU).

WalterBright · on Feb 25, 2019

Note that neither Fukushima nor Deepwater Horizon followed basic aircraft engineering principles: No single point of failure will cause the loss of the airplane.

magduf · on Feb 25, 2019

Um, I thought Fukushima was not caused by any single point of failure either: it was caused by an earthquake and a tsunami happening together (admittedly, the earthquake caused the tsunami). The earthquake caused the reactor to automatically shut down, which would have been fine, except then they got hit with a tsunami that topped their seawall and flooded the basement, shutting down the emergency generator, which disabled the pumps which cooled the reactor.

WalterBright · on Feb 26, 2019

The earthquake didn't cause any damage. The single point of failure was the seawall being breached. The rest was a zipper of each failure causing the next. The tragedy is this zipper could have been halted at many places.

For example, the hydgrogen overpressure in the reactor (caused by previous failures) was vented through an overpressure valve into a pipe. The pipe exited into the enclosed reactor building, which lead to an explosion.

The pipe should have been vented to the exterior. The cost of that would have been insignificant, like numerous other zipper-stops.

Deepwater Horizon was a big zipper, too, none of which would have been costly to prevent.

WalterBright · on Feb 26, 2019

Also, the emergency generators should have been put on platforms so they wouldn't flood. Critical machinery should have been located far enough away from the reactor so they could be worked on without fear of radiation.

Etc.

philipwhiuk · on Feb 24, 2019

I wish we did it in auto investigations.

djsumdog · on Feb 25, 2019

Except for TWA 800