More

keypusher · 2025-11-19T03:37:34 1763523454

The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.

eastdakota · 2025-11-19T12:58:50 1763557130

That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.

keypusher · 2025-11-19T15:03:11 1763564591

Thank you for the clarification and insight, with that context it does make more sense to me. Is there anything you think can be done to improve the ability to identify issues like this more quickly in the future?

QREguy · 2025-11-19T16:02:31 1763568151

Any "limits" on system should be alerted... like at 70% or 80% threshold.. it might be worth it for a SRE to revisit the system limits and ensuring threshold based alerting around it..

Fiadliel · 2025-11-19T10:22:26 1763547746

If one actually looks at the current pingora API, it has limited ability to initialize async components at startup - the current pattern seems to be to lazily initialize on first call. An obvious downside of this is that a service can startup in a broken state. e.g. https://github.com/cloudflare/pingora/issues/169

I can imagine that this could easily lead to less visibility into issues.

keypusher · 2025-09-30T10:31:56 1759228316

You might want to check out Old World. It was created by Soren Johnson, lead designer on Civ4, and shares many similarities to that era of Civ while bringing in some new ideas as well.

ACS_Solver · 2025-09-30T11:03:58 1759230238

Currently 90% off on Steam (disclaimer, I'm on the OW team).

keypusher · on Dec 15, 2024

> If I'm clear on the requirements & everyone else is clear on what I'm delivering.

That's a big 'if', and usually isn't possible without prototyping in my experience. What you're describing seems like something that would be written after a prototype is already done. Presenting a prototype (or iterating on multiple prototypes) is a better way to tease out unknowns than any document.

bccdee · on Dec 16, 2024

It's usually not possible without a technical analysis, is my point. A throw-away prototype takes much longer to build and is much harder to review and discuss with stakeholders than a document. Like, how are you going to build a prototype demonstrating design trade-offs—are you just going to build every possible version? Are you really going to make a product manager read a 2000-line PR instead of three pages of bullet points?

Presenting a prototype only really make sense for UI-centric things anyway, and even then, there's a million ways to make a UI mock-up that don't involve functioning code.

keypusher · on Nov 10, 2024

not exactly an “education giant”. a company that used to sell schoolbook cheating assistance, now being replaced by free chatgpt

nfw2 · on Nov 10, 2024

I agree the "education" label is questionable, but it had a market cap of nearly 15 billion at its peak.

readyplayernull · on Nov 10, 2024

Dot-com bubble vibes

keypusher · on May 13, 2024

May want to look at Just. It is heavily inspired by Make and shares much of the same syntax, but removes a lot of the workarounds necessary to use Make as a task runner and adds a few other features.

https://github.com/casey/just

nrclark · on May 13, 2024

If Just grew to have a large enough feature-set that it could replace GNU Make, it would collect an equal amount of warts and sharp-edges along the way.

It's a little more user-friendly, but not enough to justify how much power you lose.

I spent the last 20 minutes reading Just's documentation, and it seems like the only real wins are nicer argument parsing and && dependencies, and the --list argument. And in exchange for that, you lose file-based dependendicies and Make's powerful templating engine. Did I miss some other features that make the trade worth it?

zokier · on May 13, 2024

For lots of uses not having file-based dependency mechanism is a benefit, not a con. Sometimes less is more

nrclark · on May 13, 2024

For sure. But you also don't have to use file-based deps if you don't want. Make doesn't require it in any way. You can mix-and-match.

rgoulter · on May 13, 2024

File dependencies are useful as part of a build system.

But often tasks are a step removed from that. Targets like "all" and "clean" are good examples of useful tasks to have, even if they're unrelated to files.

A "task runner" isn't intended to replace all of a Makefile's functionality. It takes a common use case of Makefiles, and improves the user experience for this.

I think it's fair to say that it's not a big step up; but it's a definite improvement for what it does aim to do.

nrclark · on May 13, 2024

Targets don't need to produce files in Make (see: your examples of 'all' and 'clean'). And non-file targets can depend on other non-file targets. Make does this out of the box, by default.

But if you ever get to a point where you _want_ to express some source-code dependencies (say: pip-install depends on requirements.txt), you can also do that in Make. Just doesn't provide any mechanism for it.

bigstrat2003 · on May 13, 2024

I don't want file-based dependencies. I don't understand why anyone would, frankly. Just is an upgrade in that respect, not a downgrade.

nrclark · on May 13, 2024

If you don't want them, don't use them. Make doesn't need you to produce actual files called 'clean', 'build', etc. you don't even have to do anything special unless your build folder has actual files called those things (in which case, you could declare them as phonies).

From Make's point of view, they're just target names. Make doesn't care whether or not a target produces any files. Targets can satisfy dependencies for other targets, as can files on the filesystem.

taeric · on May 13, 2024

I'm intrigued. Without the file based dependency stuff, is it all incumbent on you to add the logic checking and stating if something has been done?

kstrauser · on May 13, 2024

There are so many things where that’s not relevant. “Just start” or “just test” or “just install” are idempotent, or at least stateless (assuming “start” calls out to systemctl or such).

rgoulter · on May 13, 2024

> ...idempotent, or at least stateless...

Even for a Makefile, "start", "test", "install" are likely to be phony (and not literally refer to files with those names).

The execution of a "start" task defers the role of "has it already been done" to other places.

One way of describing "task runner" is "nicer UX for .phony targets".

kstrauser · on May 13, 2024

That's all true, but understates the important of a nicer UX. I've spent lots of time over the years working around Makefiles' idiosyncrasies. It's so good at its intended purpose that it's tempting to use it for vaguely related tasks. Turns out it's not as good for doing those things it was never designed for. "Just" reimagines what make could look like if it were designed to be a task runner instead. That hits a sweet spot for many of us. It looks like the Makefiles we're used to, minus the pains in the neck we've tolerated from abusing make for our own purposes.

mitjafelicijan · on May 14, 2024

Just is amazing. I used quite a lot. It is however difficult sometimes to convince others to use it, even though it is a no-brainer.

keypusher · on Feb 6, 2024

Oculus has sold 20m+ million headsets. But it’s a very different headset, gaming focused.

https://www.roadtovr.com/quest-sales-20-million-retention-st...

sxp · on Feb 6, 2024

That was 20m of their 7th-9th headsets (after DK1, DK2, Rift, Rift S, GearVR, Go). Other than the Quest 2, I don't think any VR headset has sold at the rate the AVP is currently selling at.

keypusher · on Jan 8, 2024

I kept reading expecting to eventually get to some data, but there never was any. Just the same opinion restated over and over.

keypusher · on Dec 20, 2023

My guess is that someone at MS was testing Windows Updates or other changes from a local source. They also had some other DNS updates in their config they were testing. They took all of their config and pushed it out, when they should only have taken the other changes.

fsckboy · on Dec 20, 2023

I hope that's not their workflow, but few people I've ever worked with have know how to create operating procedures that had half a chance of succeeding, usually it's based on "oh, don't worry, i would never do that"

keypusher · on Nov 14, 2023

That’s not how it works. As mentioned in the article, this limited privilege is auctioned off to the highest bidder.

fuzzfactor · on Nov 14, 2023

Exactly, IOW ticket scalping those rare VIP experiences.

keypusher · on Oct 20, 2023