Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.


To me it reads like there was a gradual rollout of the faulty software responsible for generating the config files, but those files are generated on approximately one machine, then propogated across the whole network every 5 minutes.

> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.


Not a DBA, how do you do DB permission rollout gating?


It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.

The file should be versioned and rollout of new versions should be staged.

(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")


> It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.

That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: