Finally that time of year again! I've been looking forward to this for a long time. I usually drop off about halfway anyways (finished day 13, 14 and 13 the previous 3 years), as that's when December gets too busy for me to enjoy it properly, so I personally don't mind the reduction in problems at all, really. I'm just happy we still have great puzzles to look forward to.
Exactly. The only way this could happen in the first place was _because_ they failed at so many levels. And as a result, more layers of Swiss cheese will be added, and holes in existing ones will be patched. This process is the reason flying is so safe, and the reason why Cloudflare will be a little bit more resilient tomorrow than it was yesterday.
A colleague of mine just came bursting through my office door in a panic, thinking he brought our site down since this happened just as he made some changes to our Cloudflare config. He was pretty relieved to see this post.
You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.
If I were Cloudflare it would mean an immediate job offer well above market. That junior engineer is either a genius or so lucky that they must be bred by Pierson’s Puppeteers or such a perfect manifestation of a human fuzzer that their skills must be utilized.
This reminds of a friend I had in college. We were assigned to the same group coding an advanced calculator in C. This guy didn't know anything about programming (he was mostly focused on his side biz of selling collector sneakers), so we assigned him to do all the testing, his job was to come up with weird equations and weird but valid way to present them to the calculator. And this dude somehow managed to crash almost all of our iterations except the few last ones. Really put the joke about a programmer, a tester, and a customer walk into a bar into perspective.
I love that he ended up making a very valuable contribution despite not knowing how to program -- other groups would have just been mad at him, had him do nothing, or had him do programming and gotten mad when it was crap or not finished.
I think the rate limits for Claude Code on the Web include VM time in general and not just LLM tokens. I have a desktop app with a full end to end testing suite which the agent would run for every session that probably burned up quite a bit.
> If I were Cloudflare it would mean an immediate job offer well above market.
And not a lawsuit? Cause I've read more about that kind of reaction than of job offers. Though I guess lawsuits are more likely to be controversial and talked about.
I kind of did that back in the days when they released Worker KV, I tried to bulk upload a lot of data and it brought the whole service down, can confirm I was proud :D
It's also not exactly the least common way that this sort of huge multi-tenant service goes down. It's only as rare as it is because more or less all of them have had such outages in the past and built generic defenses (e.g. automated testing of customer changes, gradual rollout, automatic rollback, there are others but those are the ones that don't require any further explanation).
Well its easy to cause damage by messing up the `rm` command, esp with `-fr` options. So don't take it as a proxy for some great skill which is required to cause damage.
You could easily cause great damage to your Cloudflare setup, but CF has measures to prevent random customers deleting stuff from taking down the entire service globally. Unless you have admin access to the entire CF system, you can't really cause much damage with rm.
>You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.
I mean, with Cloudflare's recent (lack of) uptime, I would argue there's a degree of crashflation happening such that the prestige is less in doing so. I mean nowadays if a lawnmower drives by cloudflare and backfires that's enough to collapse the whole damn thing
Are you actually so mind-numbingly ignorant that you think Rebecca Heineman had a brother named Bill, that you would rudely and incorrectly try to correct people who knew her story well, during a memorial discussion of her life and death?
Or were you purposefully going out of your way to perpetrate performative ignorance and transphobic bullying, just to let everyone know that you're a bigoted transphobic asshole?
I don't buy that it was an innocent mistake, given the context of the rest of the discussion, and your pretending to know her family better than the poster you were replying to and everyone else in the discussion, falsely denying her credit for her own work. Do you really think dang made the Hacker News header black because he and everyone else was confused and you were right?
Do you like to show up at funerals of people you don't know, just to interrupt the eulogy with insults, stuff pennies up your ass (as you claim to do), then shit and piss all over the coffin in front of their family and friends?
How long did you have to wait until she died before you had the courage to deadname, misgender, and punch down at her in a memorial, out of hate and cowardice and a perverse desire to show everyone what kind of a person you really are?
Next time, can you at least wait until after the funeral before committing your public abuse?
Posting abusive bigoted bullshit in a memorial thread is cuckoo crazy behavior. Calling it out and describing it isn't. You're confusing describing the abuse with committing the abuse. Direct your scorn at the person I'm criticizing, unless you agree with what they did, in which case my criticism also applies directly and personally to you, so no wonder you created a throw away sock puppet account just to attempt to defend your own bigotry and abuse.
It's also what was the cause of the Azure Front Doors global outage two weeks ago - https://aka.ms/air/YKYN-BWZ
"A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions."
> May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
I'd love to know more about what those specific circumstances were!
I'm pretty sure I crashed Gmail using something weird in its filters. It was a few years ago. Every time I did something specific (I don't remember what), it would freeze and then display a 502 error for a while.
What do you imagine would be the result if you brought down cloudflare with a legitimate config update (ie not specifically crafted to trigger known bugs) while not even working for them? If I were the customer "responsible" for this outage, I'd just be annoyed that their software is apparently so fragile.
I would be fine if it was my "fault", but I'm sure people in business would find a way to make me suffer.
But on a personal level, this is like ordering something at a restaurant and the cook burning the kitchen because they forgot to take out your pizza out of the oven or something.
I would be telling it to everyone over beers (but not my boss).
What’s funny is as I get older this feeling of relief turns more like a feeling of dread. The nice thing about problems that you cause is that you have considerable autonomy to fix them. Cloudflare goes down you’re sitting and waiting for a 3 party to fix something.
Can’t speak for GP but ultimately I’d rather it be my fault or my company’s fault so I have something I can directly do for my customers who can’t use our software. The sense of dread isn’t about failure but feeling empathy for others who might not make payroll on time or whatever because my service that they rely on is down. And the second order effects, like some employee of a customer being unable to make rent or be forced to take out a short term loan or whatever. The fallout from something like this can have an unexpected human cost at times. Thankfully it’s Tuesday, not a critical payroll day for most employees.
But why does this case specifically matter? What if their system was down due to their WiFi or other layers beyond your software? Would you feel the same as well?
What about all the other systems and people suffering elsewhere in the World?
I don't understand what point you're trying to make. Are you suggesting that if I can't feel empathy for everybody at once, or in every one of their circumstances, that I should not feel anything at all for anyone? That's not how anything works. Life (or, as I believe, God) brings us into contact with all kinds of people experiencing different levels of joy and pain. It's natural to empathize with the people you're around, whatever they're feeling. Don't over-complicate it.
So you would rather be incompetent than powerless? Choice of third party vendor on client facing services is still on you, so maybe you prefer your incompetence be more direct and tangible?
Even still, you should have policies in place to mitigate such eventualities, that way you can focus the incompetence into systematic issues instead. The larger the company, the less acceptable these failures become. Lessons learned is a better excuse for a shake and break startup than an established player that can pay to be secure.
At some point, the finger has to be pointed. Personally, I don't dread it pointing elsewhere. Just means I've done my due D and C.
If customers expected third party downtime to not affect their thing then you shouldn't have picked a third party provider or spent extra resources on not having a single point of failure? If they were happy with choosing the third party with knowledge of depending on said third party provider, then it was an accepted risk.
The problem is, I still get the wrong end of the stick when AWS or CF go down! Management doesn't care, understandably. They just want the money to keep coming in. It's hard to convince them that this is a pretty big problem. The only thing that will calm them down a bit is to tell them Twitter is also down. If that doesn't get them, I say ChatGPT is also down. Now NOBODY will get any work done! lol.
This is why you ALWAYS have a proposal ready. I literally had my ass saved by having tickets with reliability/redundancy work clearly laid out with comments by out of touch product/people managers deprioritizing the work after attempts to pull it off the backlog (in one infamous case for a notoriously poorly conceived and expensive failure of a project that haunted us again with lost opportunity cost).
The hilarious part of the whole story is that the same PMs and product managers were (and I cannot overemphasize this enough) absolutely militant orthodox agile practitioners with jira.
Every time a major cloud goes down, management tells us why don't we have a backup service that we can switch to. Then I tell them that a bunch of services worth a lot more than us are also down. Do you really want to spend the insane amount of resources to make sure our service stays up when the global internet is down?
Who decided to go with AWS of CF? If its a management decision tell them you need the resources to have a fallback if they want their system to be more reliable than AWS or CF.
Haha yeah I just got off the phone and I said, look, either this gets fixed soon or there's going to be news headlines with photographs of giant queues of people milling around in airports.
When I'm debugging something, I'm not usually looking for the solution to the problem; I'm looking for sufficient evidence that I didn't cause the problem. Once I have that, the velocity at which I work slows down
Maybe this isn’t great, but I get a hint of that feeling when I’m on an airplane and hear a baby crying. For a number of years, if I heard a baby crying, it was probably my baby and I had to deal with it. But now my kids are past that phase, so when I hear the crying, after that initial jolt of panic I realize that it isn’t my problem, and that does give me the warm fuzzies. Even though I do feel bad for the baby and their parents.
Related situation: you're at a family gathering and everyone has young kids running around. You hear a thump, and then some kid starts screaming. Conversation stops and every parent keenly listens to the screams to try and figure out whose kid just got hurt, then some other parent jumps up - it's not your kid! #phewphoria
Maybe "Erleichterung" (relief)? But as a German "Schadenserleichterung" (also: notice the "s" between both compound word parts) rather sounds like a reduction of damage (since "Erleichterung" also means mitigation or alleviation).
right I thought of that at first and discarded it for that reason. Which the problem really is that the normal story of how Schadenfreude works as a bit of German language how to is that the component that it is other people's damage that is sparking joy is missing from the word itself, that interpretation must be known by the word user, if you were just creating the word and nobody had heard it before in the world it would be pretty reasonable for people to think you had just created a new word for masochism.
Not quite, that’s more like taking pleasure in the misfortune of someone else. It’s close, but the specific relief bit that it is not _your_ misfortune is not captured
I woke up getting bombarded by multiple clients messages of sites not working, I shitted my pants because I've changed the config just yesterday. When I saw the status message "cloudflare down" I was so relieved.
Good that he worked it out so quick. I recently spent a day debugging email problems on Railway PaaS, because they silently closed an SMTP port without telling anyone.
You missed a great opportunity to dead-pan him with something like "No, Bob, not just our site, you brought down the entire Internet, look at this post!"
Not inherently, but I think LLM services (and maybe other AI based stuff) are corruptible in a much more dangerous way than the things our socioeconomic system has corrupted so far.
Having companies pay to end up on the top of the search engine pile is one thing, but being able to weave commerciality into what are effectively conversations between vulnerable users and an entity they trust is a whole other level of terrible.
I'm not so sure. We've commercialized medicine, housing, music, you name it. It is that process of corruption that is at issue here. If we had ended it 10 or 20 years ago we wouldn't have AI in the sense we have it now (and that would be a good thing).
reply