Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.


I'm shocked that an automatic canary rollout wasn't an action item. Pushing anything out globally is a guaranteed failure again in the future.

Even if you want this data to be very fresh you can probably afford to do something like:

1. Push out data to a single location or some subset of servers.

2. Confirm that the data is loaded.

3. Wait to observe any issues. (Even a minute is probably enough to catch the most severe issues.)

4. Roll out globally.


Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.


I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.

Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: