The core framework (sched_ext) was written for general workloads and can quite beneficial. It lowers the costs of creating and iterating on schedulers quite a bit.
To be honest, when they started working on it I don’t think any of us expected for it to be a source of collaboration with gaming companies :)
Latency and cache coherency are the other things that make this hard. Cache coherency can theoretically be resolved by CXL, so maybe we’ll get there that way.
AI models do not need coherent memory, the access pattern is regular enough that you can make do with explicit barriers.
The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
> AI models do not need coherent memory, the access pattern is regular enough that you can make do with explicit barriers.
response to both this and the sibling: even for training, I remember some speculation that explicit synchronization might not even be needed, especially in the middle stages of training (past the early part, before fine tuning). it's not "correct" but the gradient descent will eventually fix it anyway, as long as the error signal doesn't exceed the rate of gradient descent. And "error signal" here isn't just the noise itself but the error in gradient descent caused by incorrect sync - if the average delta in the descent is very slow, so is the amount of error it introduces, right?
actually iirc there was some thought that the noise might help bounce the model out of local optimums. little bit of a Simulated Annealing idea there.
> The bigger problem is that by the time PCIe 7.0 will be actually available, 242GB/s per direction will probably not be sufficient for anything interesting.
yea it's this - SemiAnalysis/Dylan Patel actually has been doing some great pieces on this.
background:
networks really don't scale past about 8-16 nodes. 8 is a hypercube, that's easy. You can do 16 with ringbus or xy-grid arrangements (although I don't think xy-grid has proven satisfactory for anything except systolic arrays imo). But as you increase the number of nodes past 8, link count blows out tremendously, bisection bandwidth stagnates, worst-case distance blows out tremendously, etc. So you want tiers, and you want them to be composed of nodes that are as large as you can make them, because you can't scale the node count infinitely. https://www.cs.cmu.edu/afs/cs/academic/class/15418-s12/www/l...
Past about 16 nodes you just go to a switched-fabric thing that's a crossbar or a butterfly tree or whatever - and it's still a big fat chip itself, the 2nd gen nvswitch (pascal era) was about as many transistors as a Xeon 1650v1. They are on gen 3 or 4 now, and they have three of these in a giant mixing network (butterfly or fat tree or something) for just titanic amounts of interconnect between 2 racks. I don't even want to know what the switching itself pulls, it's not 300kW but it's definitely not insignificant either.
any HPC interconnect really needs to be a meaningful fraction of memory bandwidth speed if you want to treat it like a "single system". Doesn't have to be 100%, but like, needs to be 1/3 or 1/4 of normal memory BW at least. One of the theories around AMD's MCM patents was that the cache port and the interconnect port should be more or less the same thing - because you need to talk to interconnect at pretty much the same rate you talk to cache. So a cache chiplet and an interconnect could be pretty much the same thing in silicon (I guess today we'd say they're both Infinity Link - which is not the same as Infinity Fabric BTW, again the key difference being coherency). But that kinda frames the discussion in terms of the requirements here.
anyway, to your point: pcie/cxl will never be as fast as ethernet because the signal integrity requirements are orders of magnitude tighter, pcie is a very short-range link etc and requires comparatively larger PHY to drive it than ethernet does for the same bandwidth.
ethernet serdes apparently have 3x the bandwidth as PCIe (and CXL) serdes per mm of beachfront, and GPU networking highly favors bandwidth above almost any other concern (and they utterly don't care about latency). The denser you make the bandwidth, the more links (or fatter links) you can fit on a given chip. And more links basically translates to larger networks, meaning more total capacity, better TCO, etc. Sort of a gustafson's law thing.
(and it goes without saying that regardless, this all burns a tremendous amount of power. data movement is expensive.)
the shape this is taking is basically computronium. giant chips, massive interconnects between them. It's not that chiplet is a bad idea, but what's better than lots of little chiplets fused into a single processor? lots of big chiplets fused into a single processor.
And in fact that pattern gets repeated fractally. MI300X and B200 both take two big dies and fuse them together into what feels like a single GPU. Then you take a bunch of those GPUs and fuse those together into a local node via NVSwitch.
Standard HPC stuff... other than the density. They are thinking it might actually scale to at least 300 kW per rack... and efficiency actually improves when you do this (just like packaging!) because data movement is hideously expensive. You absolutely want to keep everything "local" (at every level) and talk over the interconnects as little as possible.
MLID interviewed an NVIDIA engineer after RDNA3 came out, iirc they more or less said "they looked at chiplet, they didn't think it was worth it yet, so they didn't do it. and gonna do chiplets in their own way, and aren't going to be constrained or limited to chasing the approaches AMD uses". And my interpretation of that quote has always been - they see what they are doing as building a giant GPU out of tons of "chiplets", where each chiplet is a H100 board or whatever. NVLink is their Infinity Link, Ethernet is their Infinity Fabric/IFOP.
The idea of a processor as a bunch of disaggregated tiny chiplets is great for yields, it's terrible for performance and efficiency. "tile" in the sense of having 64 tiles on your server processor is dead dead dead, tiles need to be decent-sized chunks of silicon on their own, because that reduces data movement a ton (and network node count etc). And while of course packaging lets you stack a couple dies... it also blows up power consumption in other areas, because if each chiplet is slower then you are moving more data around. The chip might be more efficient, but the system isn't as efficient as just building computronium out of big chunks.
it's been an obvious lesson from the start even with ryzen, and RDNA3 should have really driven it home: data movement (cross-CCX/cross-CCD) is both performance-bottlenecking and power-intensive, so making the chiplets too small is a mistake. Navi 32 is barely a viable product etc, and that's without even hitting the prices that most people want to see from it. Driving 7700XT down to $349 or $329 or whatever is really really tough (it'll get there during clearance but they can't do it during the prime of its life), and idle power/low-load power sucks. You want the chunks to be at least medium sized - and really as big as you can afford to make them. Yields get lower the bigger you get, of course, but does anybody care about yielded price right now? Frankly I am surprised NVIDIA isn't pursing Cerebras-style wafer-scale right now tbh.
again, lots of words to say: you want the nodes to be as fat as possible, because you probably only get 8 or 16 nodes per tier anyway. So the smaller you make the nodes, the less performance available at each tier. And that means slower systems with more energy spent moving data. The analyst (not NVIDIA)'s claim is that he thinks water-cooled 300kw would be more efficient than current systems.
(e: power consumption was 150-200 kW for Cray-2 so NVIDIA's got a ways to go (currently 100 kW, rumored 200 kW) to even to reach the peak of historical "make it bent so there's less data movement" style hyperdense designs. Tbh that makes me suspect that analyst is probably right, it's both possible and might well improve efficiency, but due to the data movement factors this time, rather than latency. Ironic.)
> networks really don't scale past about 8-16 nodes. 8 is a hypercube, that's easy. You can do 16 with ringbus or xy-grid arrangements (although I don't think xy-grid has proven satisfactory for anything except systolic arrays imo). But as you increase the number of nodes past 8, link count blows out tremendously, bisection bandwidth stagnates, worst-case distance blows out tremendously, etc.
I don't think cube/hypercube is optimal, the opposite corner is too many hops away.
For 8 nodes, 3 links each, you're better of crossing some of the links to bring the max distance down to 2. Even better is a Petersen graph that can do 10 nodes at max distance 2.
If a proper 16-node hypercube is acceptable, 4 links per node and most distances 2-3 hops, then better arrangement lets you fit up to 41 nodes at max distance 3.
If you allow for 5 links or max distance 4 you can roughly double that node count, and if you allow both you can have over 200 nodes.
If you let the max distance climb, then it will seriously bloat your bandwidth requirements. But you can fit quite a lot of nodes at small max distances.
> anyway, to your point: pcie/cxl will never be as fast as ethernet because the signal integrity requirements are orders of magnitude tighter, pcie is a very short-range link etc and requires comparatively larger PHY to drive it than ethernet does for the same bandwidth.
Why does the short range link have much tighter requirements, and what would it take to sloppen them up?
As long as the GPU-local memory can hold a couple layers at a time, I don't think the latency to the currently-inactive layers matters very much, only the bandwidth.
A professor actually told us that freeing prior to exit was harmful, because you may spend all of that time resurrecting swapped pages for no real benefit.
Counterpoint is that debugging leaks is ~hopeless unless you have the ability to prune “intentional leaks” at exit
> debugging leaks is ~hopeless unless you have the ability to prune “intentional leaks” at exit
Not in general. It depends on your debugger. For example, valgrind distinguishes between harmless "visible leaks", memory blocks allocated from main or on global variables, and "true leaks" that you cannot free anymore. The first ones are given a simple warning by the leak detector, while the true leaks are actual errors.
I had to debug a program that did just that once, long ago, and the fix was to not free on exit. The program's behavior had been that it took ~20m and then one day it ran for hours and we never found out how long it would have taken. Fortunately it was a Tcl program, and the fix was to remove `unset`s of large hash tables before exiting.
OpenAI's recruiting pitch was 5-10+ million/year in the form of equity. The structure of the grants is super weird by traditional big-company standards, but it was plausible enough that you could squint and call it the same. I'd posit that many of the people jumping to OpenAI are doing it for the cash and not the mission.
Yes and we need tools to route around stupidity to keep checks and balances on stupidity. Otherwise locked-in stupidity can get even more stupid and that’s how suffering occur at scale.
If you don't like the law, lobby politicians to change it.
You don't get to decide that speed limits are stupid and go 100. You don't get to decide that taxes are stupid and not pay them. You don't get to decide that rules against murder are stupid and ignore them.
> You don't get to decide what to put into your body and smoke weed. You don't get to make a choice about your pregnancy and have an abortion. You don't get to not be a chattel slave in 1840.
What I can't afford is to deal with a regressive tax legislation [], while trying to get US currency in the black market [2] to save and pay for imported goods and without a line of credit because the rate is ~190% and inflation ~130% [3].
I do pay my taxes, also think high worth individuals should be subject of taxation, but your rigid framework leaves out many people that use crypto as their only way to get paid and/or save money because traditional banking system is broken in their countries.
"amassing currency to purchase imported goods without a line of credit" is not a human right.
I'm sad that tariffs mean I can't bring in dirt cheap liquor from the Caribbean to sell for massive profits but that's not an excuse to not pay my taxes.
I disagree, it is if you need the goods to buy supplies for you to work.
If I need syringes to work, national syringes costs 3 usd a pack (and they are out of stock) and I can get the same pack for 85 cents in Chile, I'll exchange pesos crocante for dolar blue, cross the border to Chile and get them there.
I wouldn't call it human right, but the idea of amass currency to buy goods seems pretty essential for our modern life.
Although it's sad that tariffs make selling cheap liquor unprofitable in your country, I think we're thinking about different scales of income.
I had an internship at Fog Creek and would add that it was probably the most friendly and harmonious place I worked, which made it very reality-show-incompatible (and very 21-year-old-me incompatible, I wasn't asked back lol). Certainly the representation of you as an asshole was ridiculous IMO.
(Since you're answering arbitrary Fog Creek questions) In retrospect, do you think it was a mistake to make kiln hg-centric at first?
(Since you're answering arbitrary Fog Creek questions) In retrospect, do you
think it was a mistake to make kiln hg-centric at first?
No; I think it was a mistake to not also support Subversion out-of-the-box.
Our customers were overwhelmingly Windows shops, and Git on Windows in 2007 was just unusably bad. It really would not have been a viable option. (I did look at Bazaar and Fossil, which were good players on both Windows and Unix, but neither seemed like a good fit for other reasons.) But Kiln's core value prop at the beginning was actually code review, and I think we could've found a cool way to bring in a Phabricator-like patch workflow that would've meshed just fine with Subversion and given our customers a much easier way to get access to Kiln's goodness. In that world, Mercurial would be a kind of bonus feature you could use, not the only way into Kiln. The resulting product would've been very different, mind, but I think it would've gone way better.
The other three technical mistakes we made, since you didn't ask me, were having FogBugz target .NET instead of Java (given the immaturity of Mono at the time only; I love .NET); having Wasabi compile to C# instead of IL (especially given the previous note); and having Copilot directly modifying VNC and its protocol instead of just jacketing it with a small wrapper app. These three decisions collectively slowed the company down a ton at a time when we shouldn't have let ourselves do that.
I enjoyed working with you, Alex. Glad to see you doing well!
It's a bit harsh but I always feel like Fog Creek might be the cautionary tale in "what happens if you over hire for capability vs. your requirements?" I think that a less capable team would have never landed on the "let's maintain our own programming language" approach w.r.t. Wasabi.
As an aside, I do think that targeting Mono was the right thing to do for the universe, as it butterfly-effected tedu into writing weird and wonderful technical blog posts for the next ten years :p
As an aside, I do think that targeting Mono was the right thing to do for
the universe, as it butterfly-effected tedu into writing weird and wonderful
technical blog posts for the next ten years :p
I've never figured out whether that work broke him or was simply his muse, but I also do confess to liking the result. So not a complete loss.
I am always happy when I look back on the monobugz days. I think it was a formative experience in evaluating claims like well, of course it works, so many other people already use and depend on it. O RLY?
I still remember when I came to you with my “attachments shouldn’t live in the mssql database” plan and you said “yeah, probably, but doing it any other way would be a million times harder to maintain.” You were 100% right and I think of it often when I encounter someone who is about to do a similar dumb thing for “the right reasons.”
wait - context? why? i'm sure you're right, as obviously I wasn't there don't have the clearly important context, but why was it 1000000x harder to maintain if attachments didn't live in mssql?
Not 100% sure of the rationale in this case. I imagine it might have to do with everyone who runs an instance needs to maintain an additional storage system along with all the associated costs, which is not just storage alone.
Databases store stuff really well. If get to the level of needing to configure storage for different tiers of access they can do that, it just takes a bit of work. Of course if your blob data is stored in tables that have OLTP data in them, then you have a bit of work to do to separate it out.
This is speaking from recent experience of having to manage random blobs of sensitive data in s3 buckets that engineers have created rather than bothering to put in the main application data store.
The punchline, I suppose, is that we did switch from MSSQL to Mogile and, eventually, S3. But we still had the code to sometimes store attachments in the DB because that’s how we shipped complete backups to customers!
With no context, but much experience with databases...
Storing attachments as a blob in a database has all sorts of disadvantages I'm sure you're aware of, but it has the major advantage that if you can see the reference to the attachment, you can fetch the attachment. With links to a filesystem, you have to deal with issues like the frontends can't access the files because they're on the wrong system, or the network filesystem is down or .... There's a lot of possibilities.
I was at the strategic offsite where we decided to go with .NET. Java wasn’t installed on Windows by default.
The original version of Wasabi, known as Thistle, was written in Java, by the intern in the class before Aardvark’d. It transpiled ASP to PHP.
Every intern class was named after an animal with the next consecutive letter. I don’t remember any of them except Aardvark, and I was a “B????” intern!
From memory and a little grepping of the Weekly Kiwi archives, I found: "Project Null Terminator", Aardvark, B??, Caribou, Dingo, E??, Flying Fox, Giganotosaurus, ??
> Mark Zuckerberg justified the move in a Facebook post: “Open source drives innovation because it enables many more developers to build with new technology. It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.” [0]
I don't work in the space so I can't comment much, but my team (Server Operating Systems and similar) has been working with Linux, Systemd, Centos, etc for years for essentially the same reason. Working with the community forces you to build better stuff, because, in a healthy community, no one cares that a "Director from Meta wants it" which is the kind of thing that kills proprietary projects in the cradle all the time.
I don’t use any of meta’s products but i must say Meta and Mark Z’d support for open source is commendable. They’ve released and supported some excellent open source models and frameworks
To be honest, when they started working on it I don’t think any of us expected for it to be a source of collaboration with gaming companies :)
(I’m a middle manager at Meta)