It's the most plausible, fact-based guess, beating other competing theories.
Understaffing and absences would clearly lead to delayed incident response, but such an obvious negligence and breach of contract would have been avoided by a responsible cloud provider, ensuring supposedly adequate people on duty.
An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place.
AWS engineers being formerly competent but currently stupid, without organizational issues, might be explained by brain damage. "RTO" might have caused collective chronic poisoning, e.g. lead in drinking water, but I doubt Amazon is so cheap.
> An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place
You seem to be misunderstanding the nature of the issue.
The DNS records for DynamoDB's API disappeared. They resolve to a dynamic bunch of IPs that constantly change.
A ton of AWS services that use DynamoDB could no longer do so. Hardcoding IPs wasn't an option. Nor could clients do anything on their side.
> a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses)
Did you consider that DNS might’ve been a symptom? If the DynamoDB DNS records use a health-check, switching DNS servers will not resolve the issue and might make it worse by directing an unusually high volume of traffic at static IPs without autoscaling or fault recovery.
The article describes evidence for a concrete, straightforward organizational decay pattern that can explain a large part of this miserable failure. What's "self-serving" about such a theory?
My personal "guess" is that failing to retain knowledge and talent is only one of many components of a well-rounded crisis of bad management and bad company culture that has been eroding Amazon on more fronts than AWS reliability.
What's your theory? Conspiracy within Amazon? Formidable hostile hackers? Epic bad luck? Something even more movie-plot-like? Do you care about making sense of events in general?
We've witnessed someone repeatedly shoot themselves in the foot a few months ago. It is indeed a guess that it may cause their current foot pain, but it is a rather safe one.