If CloudFlare is down, a significant portion of the Internet is down. Not that i...

theideaofcoffee · on July 6, 2023

And you don't have to have the resources of Microsoft or Apple to plan and build for the eventuality that a provider becomes intermittent or unavailable. There are fundamental aspects of running an internet-facing service and they failed at one of the most basic.

stuff4ben · on July 6, 2023

LOL ok, they "failed". They haven't had an outage like this in decades and this one only affected a small number of their clients. But sure, let's spend money on providing a backup for CF. Armchair QBs are the worst.

tedivm · on July 6, 2023

The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

I've honestly never had a service with a single outbound path. Most datacenters where you rent colo have two or three providers as part of their network. In the cases where I've had to manage my own networking inside of a datacenter I always pick two providers in case one fails.

> Work is now underway to select a provider for a second transit connection directly into our servers — either via Megaport, or from a service with their own physical presence in 365’s New Jersey datacenter. Once we have this, we will be able to directly control our outbound traffic flow and route around any network with issues.

Having multiple transit options is High Availability 101 level stuff.

toast0 · on July 6, 2023

> The issue isn't that they needed a backup to cloudflare. The problem was they only have a single internet provider at their datacenter, so they couldn't communicate with Cloudflare.

That's not the issue. With Cloudflare MagicTransit, packets come in from Cloudflare, and egress normally. They were able to get packets from Cloudflare, but egress wasn't working to all destinations. I wasn't able to communicate with them from my CenturyLink DSL in Seattle, but when I forced a new IP that happened to be in a different /24, because I was seeing some other issues too, the fastmail issues resolved (although timing may be coincidental). Connecting via Verizon and T-Mobile, or a rented server in Seattle also worked. It's kind of a shame they don't provide services with IPv6, because if 5% of IPv4 failed and 5% of IPv6 failed, chances are good that the overall impact to users would be less than 5%, possibly much less, depending on exactly what the underlying issue was (which isn't disclosed); if it was a physical link issue, that's going to affect v4 and v6 traffic that is routed over it, but if it's a BGP announcement issue, those are often separate.

withinboredom · on July 6, 2023

You’d be surprised at how many things break when different routes are chosen. Like etcd, MySQL, and so much more.

tedivm · on July 6, 2023

Those are generally on internal networks and rarely need to communicate with the internet. They shouldn't be affected by this.

withinboredom · on July 6, 2023

Twould be nice…

theideaofcoffee · on July 6, 2023

Yet they still had the outage. I take exception to being called an 'armchair QB' when most of my career has been spent being called in to repair failures like this, providing postmortem advice to weather future ones and fix technical and cultural issues that give rise to just this type of thinking: oh, it won't happen to us because it has never happened to us.

2b3a51 · on July 6, 2023

In your experience, what kind of cost multiple is involved in remediation of the kinds of failure you deal with?

Is it x2 or x100 or somewhere in between?

theideaofcoffee · on July 6, 2023

Since you need two of everything, or more, two switches, two physical links, (hopefully) two physical racks or cabinets, and all that, it's minimum x2, but nowhere near x100. The cost for additional physical transit links is generally pretty reasonable, depending on provider, if you have more you can negotiate better rates, same with committed bandwidth. You can get better rates if you buy more.

There are a lot of aspects to that, but the cost of doing all of the above is a lot less than not having it and failing to have it at the wrong moment and losing money that way. Each business needs to weigh their risk against how much they want to invest and how much they think they can tolerate in terms of downtime.

2b3a51 · on July 6, 2023

Seems logical thanks for engaging with the question.

squeaky-clean · on July 7, 2023

You're essentially saying "They haven't had an outage yet, so they don't need redundancy". I hope you realize how bad of an idea that is.

Also calling them an armchair QB? Very mature. Their comment is more correct than yours.