Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.


Thank you for the clarification and insight, with that context it does make more sense to me. Is there anything you think can be done to improve the ability to identify issues like this more quickly in the future?


Any "limits" on system should be alerted... like at 70% or 80% threshold.. it might be worth it for a SRE to revisit the system limits and ensuring threshold based alerting around it..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: