In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rol...

penteract · 2025-11-19T02:28:10 1763519290

To me it reads like there was a gradual rollout of the faulty software responsible for generating the config files, but those files are generated on approximately one machine, then propogated across the whole network every 5 minutes.

> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.

helloericsf · 2025-11-19T02:32:31 1763519551

Not a DBA, how do you do DB permission rollout gating?

shadowgovt · 2025-11-19T05:25:07 1763529907

It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.

The file should be versioned and rollout of new versions should be staged.

(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")

dwattttt · 2025-11-19T07:35:08 1763537708

> It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.

That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.