Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is possible that any number of things people on this thread have called out are, in fact, the right move for the system Cloudflare built (it's hard to know without knowing more about the system, and my intuition for their system is also faulty because I irrationally hate periodic batch systems like these).

Most of what I'm saying is:

(1) Looking at individual point failures and saying "if you'd just fixed that you wouldn't have had an incident" is counterproductive; like Mr. Oogie-Boogie, every big distributed system is made of bugs. In fact, that's true of literally every complex system, which is part of the subtext behind Cook[1].

(2) I think people are much too quick to key in on the word "config" and just assume that it's morally indifferentiable from source code, which is rarely true in large systems like this (might it have been here? I don't know.) So my eyes twitch like Louise Belcher's when people say "config? you should have had a staged rollout process!" Depends on what you're calling "config"!

[1] https://howcomplexsystems.fail/



I just want to point out a few things you may overlooked. First, the bot config gets updated every 5 minutes, not in seconds. Second, they have config checks in other places already ("Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input"). They could probably even align everything in CI/CD if they'd run the config verifier where the configs are generated. This is of course all hindsight blind guessing, but you make it sound a bit arcane and impossible to do anything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: