More

kilotaras · 2025-11-04T09:40:07 1762249207

IIUC grandparent is in Norway as he quoted the price in NOK.

esperent · 2025-11-04T11:06:59 1762254419

Oh good catch. Somehow my eyes skipped over NOK and just saw the $2.3k.

... Why does someone in Norway need a split AC unit? I don't think they're very efficient for heating.

EDIT: I researched this and it seems the type designed for heating in negative zero temperatures are much more expensive. So maybe that explains it?

Ekaros · 2025-11-04T13:36:02 1762263362

Similar to Finland. Quite lot of time in the year like right now outside temperature is above zero, but below say 15C. So heatpumps are in very good efficiency range.

kilotaras · 2025-10-20T15:05:24 1760972724

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

bee_rider · 2025-10-20T19:06:49 1760987209

I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

miki123211 · 2025-10-21T07:59:40 1761033580

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

cnr · 2025-10-21T15:38:41 1761061121

Let's say, then, that it's not so much "dumb and wasteful" as "energy inefficient". In fact, this can be quite wise in a modern world full of surveillance-as-a-business and "us-east-1 disasters"

cnr · 2025-10-21T12:29:30 1761049770

Can you elaborate the last statement? Don't quite understand why loading local LLM to GPU RAM, using it for the job and then "ejecting" is "dumb and wasteful" idea?

carderne · 2025-10-21T13:14:58 1761052498

Layman understanding:

Because as a function of hardware and electricity costs, a “cloud” GPU will be many times more efficient per output token. You aren’t loading/offloading models and don’t have any parts of the GPU waiting for input. Everything is fully saturated always.

OJFord · 2025-10-21T15:55:39 1761062139

I believe GP means it still to be connected to 'if this kind of latency is unacceptable to you' - i.e. you can't load/use/unload, you have to keep it in RAM all the time.

In that case it's massively increasing your memory requirement not just to the peak the model needs, but to + whatever the other biggest use might be that'll be inherently concurrent with it.

behnamoh · 2025-10-21T12:58:30 1761051510

> This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

Local models are never a dumb idea. The only time it's dumb to use them in an enterprise is if the infra is Mac Studio with M3 Ultra because pp time is terrible.

svachalek · 2025-10-20T20:53:37 1760993617

Models take a lot of VRAM which is tightly coupled to the GPU so yeah, it's basically sitting there with the model waiting for use. I'm sure they probably do idle out but a few minutes of idle time is a lot of waste--possibly the full 82% mentioned. In this case they optimized by letting the GPUs load multiple models and sharing the load out by token.

jychang · 2025-10-21T00:39:14 1761007154

They definitely won't idle out- if they idle out, it'll take on the order of up to 60 seconds to load the model back into VRAM, depending on the model.

That's an eternity for a request. I highly doubt they will timeout any model they serve.

godelski · 2025-10-21T05:05:19 1761023119

  > That's an eternity for a request. I highly doubt they will timeout any model they serve.

That's what easing functions are for.

Let's say 10 GPUs are in use. You keep another 3 with the model loaded. If demand increases slowly you slowly increase your headroom. If demand increases rapidly, you also increase rapidly.

The correct way to do this is more complicated and you should model based on your usage history, but if you have sufficient headroom then very few should be left idle. Remember that these models do requests in batches.

If they don't timeout models, they're throwing money down the drain. Though that wouldn't be uncommon.

jychang · 2025-10-21T05:28:01 1761024481

That's only if you're expecting 10 GPUs in use. They're dealing with ~1 GPU in use for a model, just sitting there. Alibaba has a very long tail of old models that barely anyone uses anymore, and yet they still serve.

Here's a quote from the paper above:

> Given a list of M models to be served, our goal is to minimize the number of GPU instances N required to meet the SLOs for all models through auto-scaling, thus maximizing resource usage. The strawman strategy, i.e., no auto-scaling at all, reserves at least one dedicated instance for each model, leading to N = O(M)

For example, Qwen2 72b is rarely used these days. And yet it will take up 2 of their H20 gpus (with 96GB VRAM) to serve, at the bare minimum, assuming that they don't quantize the BF16 down to FP8 (and I don't think they would, although other providers probably would). And then there's other older models, like the Qwen 2.5, Qwen 2, Qwen 1.5, and Qwen 1 series models. They all take up VRAM if the endpoint is active!

Alibaba cannot easily just timeout these models from VRAM, even if they only get 1 request per hour.

That's the issue. Their backlog of models take up a large amount of VRAM, and yet get ZERO compute most of the time! You can easily use an easing function to scale up from 2 gpus to 200 gpus, but you cannot ever timeout the last 2 gpus that's serving the model.

If you read the paper linked above, it's actually quite interesting how Alibaba goes and solves this problem.

Meanwhile on the other hand, Deepseek solves the issue by just saying "fuck you, we're serving only our latest model and you can deal with it". They're pretty pragmatic about it at least.

jgalt212 · 2025-10-21T11:59:57 1761047997

The thundering herd breaks this scheme.

all2 · 2025-10-21T01:46:48 1761011208

If I had to handle this problem, I'd do some kind of "split on existing loaded GPUs" for new sessions, and then when some cap is hit, spool an additional GPU in the background and the transfer the new session to that GPU as soon as the model is loaded.

I'd have to play with the configuration and load calcs, but I'm sure there's a low param, neat solution to the request/service problem.

tobyhinloopen · 2025-10-21T14:50:30 1761058230

Why does it take 60 seconds to load data from RAM to VRAM? Shouldn't the PCIE bandwidth allow it to fully load it in a few seconds?

throw_me_uwu · 2025-10-21T18:21:31 1761070891

Because ML infra is bloatware beyond belief.

If it was engineered right, it would take:

- transfer model weights from NVMe drive/RAM to GPU via PCIe

- upload tiny precompiled code to GPU

- run it with tiny CPU host code

But what you get instead is gigabytes of PyTorch + Nvidia docker container bloatware (hi Nvidia NeMo) that takes forever to start.

arthurcolle · 2025-10-21T02:30:38 1761013838

That's why deepseek only serves two models

andy_ppp · 2025-10-20T23:46:17 1761003977

How does this work with anything but trivially small context sizes!?

jychang · 2025-10-21T00:39:52 1761007192

Tensor parallelism, so you only need to store a fraction of kv cache per gpu.

smallnix · 2025-10-20T19:41:08 1760989268

> I guess I’d assumed this sort of thing would be allocated dynamically

At the scale of a hyperscaler I think Alibaba is the one that would be doing that. AWS, Azure and I assume Alibaba do lease/rent data centers, but someone has to own the servers / GPU racks. I know there are specialized companies like nscale (and more further down the chain) in the mix, but I always assumed they only lease out fixed capacity.

yorwba · 2025-10-20T20:08:31 1760990911

The paper is about techniques to do that dynamic allocation to maximize utilization without incurring unacceptable latencies. If you let a GPU sit idle for several minutes after serving a single request, you're setting money on fire. So they reuse it for a different model as soon as possible, starting even before the first request is finished, because: If you don't have a dedicated GPU for a model, are you going to wait for a multi-gigabyte transfer before each request? So they have a dedicated GPU (or two, one for prefill, one for decode) for a group of models that are processed in an interleaved fashion, scheduled such that they stay within the latency budget.

citizenpaul · 2025-10-20T22:25:11 1760999111

>Do the GPUs just sit there with the models on them when the models are not in use

I've assumed that as well. It makes sense to me since loading up a model locally takes a while. I wonder if there is some sort of better way I'm not in the know about. That or too GPU poor to know about.

make3 · 2025-10-20T19:24:11 1760988251

the models are huge, so not a single (latest gen) one can fit on a single GPU.

It's likely that these are small unpopular (non flagship) models, or that they only pack eg one layer of each model.

svachalek · 2025-10-20T20:55:52 1760993752

Per the very short article, the solution was to pack multiple models per GPU.

make3 · 2025-10-20T23:00:28 1761001228

yes but that could mean a layer per model

hinkley · 2025-10-21T01:43:42 1761011022

So 82% of 17.7%?

14.5% is worth a raise at least. But it’s still misleading.

abejfehr · 2025-10-22T13:39:42 1761140382

I don't think that's what this is saying, isn't it that 100 - ~82 = 17.7% ?

hinkley · 2025-10-22T18:13:25 1761156805

That is a confusing coincidence, but no.

> Reserving full GPU instances for these models leads to allocating 17.7% of our GPUs to serve only 1.35% of requests

> Deployment results show that Aegaeon reduces the number of GPUs required for serving these models from 1,192 to 213, highlighting an 82% GPU resource saving.

82% of their CPUs were serving 98.6% of all traffic. If they reduced the cluster size, they got it to 96.2% of their CPUs serving 98.6% of their traffic. If they reallocated those, which is more likely, then 96.8% of their CPUs are serving 98.6% of all requests, or around 17% more capacity for popular requests on the same hardware.

yorwba · 2025-10-20T15:27:53 1760974073

Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.)

somerandomdude2 · 2025-10-20T17:48:14 1760982494

Really:

"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."

Which, if you scale it, matches the GPs statement.

yorwba · 2025-10-20T19:45:18 1760989518

From the SCMP article you might get the impression that the various figures all refer to the same GPU cluster, but in the paper itself it's very clear that this is not the case, i.e. the 213 GPUs in the smaller cluster are not serving 1.35% of the requests in the larger cluster. Then if you want to scale it, you have a choice of different numbers you could scale, and each would get different results. Since they're constrained by the limited number of different models a single GPU can serve, I think scaling by the number of models is the most realistic option.

xor1101 · 2025-10-21T07:29:31 1761031771

Doesnt sound right

MangoCoffee · 2025-10-20T18:58:15 1760986695

In the past, software and computer engineers would tackle problems head-on, designing algorithms and finding creative solutions.

thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)

_heimdall · 2025-10-20T19:35:45 1760988945

If you're one who sees progress as an end goal unto itself, what you describe is a good thing. When one party is attempting novel solutions to outcompete the competition we will be faster to whatever the next change is.

That said, I'm not sure what the US policies specifically have to do with this. Countries are always in competition with one another, and if one industry or technology is considered a national security threat they will guard it.

coliveira · 2025-10-21T02:27:38 1761013658

If AI is a threat to other nations, why is anyone even supporting this? Are we really trying to annihilate the planet as quickly as possible?

_heimdall · 2025-10-22T13:16:22 1761138982

For the same reason we built nuclear bombs, "if I don't my enemies will"

coliveira · 2025-10-23T03:11:50 1761189110

Are we creating enemies so quickly like this?

coliveira · 2025-10-23T03:11:27 1761189087

AI is a bomb that has already been released.

kilotaras · 2025-09-10T09:12:32 1757495552

Two of the downed drones were on pretty reasonable path to fling back into Ukraine.

https://x.com/Tatarigami_UA/status/1965668064865013884

iyn · 2025-09-10T09:36:21 1757496981

Thanks, that provides useful context.

kilotaras · 2025-08-16T08:53:35 1755334415

> I thought that this kind of experience was reserved for Eastern Europeans

Being casually racist on the other hand is a time-honored pan-European tradition, proudly upheld by Swiss.

kilotaras · 2025-05-01T09:00:21 1746090021

Team X is responsible for feature Foo; feature Foo is slow; team X introduces Foo-preload, metrics go up, person responsible gets a bonus.

Multiply that by tens (or even hundreds) of teams and your app startup (either on desktop or mobile) is now a bloated mess. Happened to Office, Facebook iOS and countless others.

One solution is to treat startup cycles as a resource similar to e.g. size or backend servers.

bjackman · 2025-05-01T09:56:20 1746093380

> One solution is to treat startup cycles as a resource similar to e.g. size or backend servers.

The only way to achieve performance metrics in a large org IMO.

Google Search is still fast because if you degrade p99 latency an SRE will roll back your change. Macbooks still have good battery life because Apple have an army of QA engineers and if they see a bump on their Ammeters that MacOS release doesn't go ahead.

Everything else (especially talking about "engineers these days just don't know how to write efficient code") is noise. In big tech projects you get the requirements your org encodes in its metrics and processes. You don't get the others. It's as simple as that.

Never worked at MS but it's obvious to me that the reason Windows is shit is that the things that would make it good simply aren't objectives for MS.

netruk44 · 2025-05-01T12:04:32 1746101072

As an ex-Microsoft SDET who worked on Windows, we used to test for those things as well. In 2014.

Then Microsoft made the brave decision that testers were simply unnecessary. So they laid off all SDETs, then decided that SDE’s should be in charge of the tests themselves.

Which effectively made it so there was no test coverage of windows at all, as the majority of SDE’s had not interacted with the test system prior to that point. Many/most of them did not know how to run even a single test, let alone interpret its results.

This is what Microsoft management wanted, so this is what they got. I would not expect improvement, only slow degradation as Windows becomes Bing Desktop, featuring Office and Copilot (Powered By Azure™).

bjackman · 2025-05-01T12:37:02 1746103022

Makes perfect sense. It recently became clear to me (e.g. [2]) that it's not a cohesive concept but to me personally this is the meaning of POSIWID [1].

Basically making Windows a good desktop OS is not in any meaningful way the "purpose" of that part of MS. The "purpose" of any team of 20+ SWEs _is_ the set of objectives they measure and act upon. That's the only way you can actually predict the outcomes of its work.

And the corrolary is that you can usually quite clearly look at the output of such an org and infer what its "purpose" is, when defined as such.

[1] https://en.m.wikipedia.org/wiki/The_purpose_of_a_system_is_w...

[2] https://www.astralcodexten.com/p/highlights-from-the-comment...

HPsquared · 2025-05-01T09:11:16 1746090676

Then the OS team will fight back with options to disable all of these startup things. Like the Startup tab in Windows Task Manager with an "impact" column and easy button to disable annoying startup programs. It's interesting to even see it play out within the same company.

macleginn · 2025-05-01T09:47:37 1746092857

The only impact values I see on my home machine are "Not measured" and "None".

threatripper · 2025-05-01T09:02:50 1746090170

The solution is simple: New OKRs and KPIs in the next cycle reversing some of the current ones, then new bonuses for reaching them. Repeat.

taneq · 2025-05-01T10:52:34 1746096754

“I could do this all day!”

vasco · 2025-05-01T09:13:51 1746090831

Office codebase is soon going to probably be older than most people that work on it.

andruby · 2025-05-01T09:28:43 1746091723

That could already be the case. The initial release is from 1990, so the codebase is at least 35 years old.

I don't have a good guess for the average age of software developers at Microsoft, but claude.ai guesses the average "around 33-38 years" and the median "around 35-36 years old".

bcraven · 2025-05-01T10:11:37 1746094297

"but claude.ai guesses"

To my ears this is the equivalent of "some guy down the pub said", but maybe I am a luddite.

n8m8 · 2025-05-01T15:44:34 1746114274

You're not a luddite, they disclosed it because you're _supposed_ to take it with a grain of salt

claudex · 2025-05-01T10:01:11 1746093671

Office was released in 1990, but Excel in 1985 and Word in 1983.

gigel82 · 2025-05-01T22:58:30 1746140310

I'm told from MS friends that there are still files with the intact 1987 changelog in Word; as well as workarounds for dot matrix printers that were released 40+ years ago.

Also, the Office codebase is significantly larger than Windows (and has been for a while), that was surprising to me.

alliao · 2025-05-01T11:22:51 1746098571

make the apps trade with each other using cpu/memory as money lol and they earn money by usage

RegW · 2025-05-01T12:56:53 1746104213

No. No. No.

Microsoft need to update the spec for all new personal computers to include mandatory pre-load hardware. This would have a secondary CPU, RAM and storage used for pre-loading licensed Office products before your laptop boots. AI would analyse your usage patterns and fire-up Office for you before you even get to work in the morning.

Perhaps, this could even allow you to have Office on-hand, ready-to-use on its own hardware module, while you develop Linux application on your main CPU.

Further down the line. Someone see an opportunity to provide access to compatible modules in the cloud, allowing re-use of older incompatible hardware. But there would be the danger that service (without the support of MS), may go bust, leaving those users without their mandatory instant access to licensed Office products, forcing upgrades to even newer hardware.

immibis · 2025-05-01T09:50:53 1746093053

Raymond Chen wrote about this.

kilotaras · 2025-03-09T16:39:24 1741538364

I believe a big crux is in definition of "war ended".

You (and Donald Trump) seem to be using "Ukraine and Russia stop shooting at each other right now", while Ukraine operates more under "Russia stops shooting at us for the foreseeable future, 20 years at least." Russia has previously broken a number of ceasefires and written agreements (including the infamous Budapest memorandum) and so Ukraine is not super trusting to agreements not backed by anything.

moduspol · 2025-03-09T17:18:30 1741540710

What Ukraine will accept is entirely dependent on how much funding they will get from foreign powers to continue their war effort.

I've had a lot of responses to my comment, yet I've seen no alternative ideas presented that will result in a different outcome. What is your plan for getting Russia to lose this war?

kilotaras · 2025-01-17T12:17:20 1737116240

> or you can include them in the broad pool, and the people with a full-cinderblock home in a non-flammable state pay $20 more a year so the entire endeavour can work

And you immediately start loosing customers to insurers that either did the former or left LA alltogether. This changes $20 surcharge into $25 surcharge, causing more customers to leave, causing surcharge to increase and so on.

kilotaras · on Dec 1, 2024

> cleaner air

I'm not sure that move will help much. EVs PM2.5 footprint, is 0.5x of ICEv [0]. Good but not game-changing.

[0] https://pubmed.ncbi.nlm.nih.gov/35760182/

kilotaras · on Nov 21, 2024

> IRI measures how much a car moves vertically as it travels over a given distance, and is typically given in units like “inches per mile” or “millimeters per meter.

How accurate are phone accelerometers these days? Could Uber/Lyft/etc. start collecting that data from drivers phones.

kilotaras · on Nov 5, 2024

There's a reasonably simple change that would allow application of Harberger's taxes.

1. Restrictions have to have a "controlling party": a dedicated party that controls the restrictions and can agree to lift it. Classical example would be HOA, but it can also be a seller if they want to sell property with additional restrictions.

2. The controlling party sets the price of a restriction

3. Restricted party can remove the restriction paying price set in 2 to controlling part.

4. The controlling party pays tax as percentage of price set in 2.

s1artibartfast · on Nov 5, 2024

So you still have renters paying property taxes.

Renters are the controlling party. Landlords are the restricted party- they can not rent to other people or use the house themselves.

kilotaras · on Nov 5, 2024

In a supply-restricted markets taxes and subsidies are ultimately paid by a supplier independent on who actually transfers the money to government.

s1artibartfast · on Nov 5, 2024

Sure, but that doesn't preclude crappy government tax policy having an impact on markets.

Consumption and supply is not fixed and taxes aren't a free lunch.