Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can get registered DDR4 for ~$1/GB. A trillion parameter model in FP16 would need ~2TB. Servers that support that much are actually cheap (~$200), the main cost would be the ~$2000 in memory itself. That is going to be dog slow but you can certainly do it if you want to and it doesn't cost $50,000.


Even looking on Amazon, DDR4 seems still a decent bit above $2/GB:

2 x 32GB: $142

2 x 64GB: $318

8GB: $16

2 x 16GB: $64

2TB of 128GB DDR4 ECC: $9,600 (https://www.amazon.com/NEMIX-RAM-Registered-Compatible-Mothe...)

> Servers that support that much are actually cheap (~$200)

What does this mean? What motherboards support 2TB of RAM at $200? Most of them are pushing $1,000. With no CPU.

It may not hit $50K, but it's definitely not going to be $2K.


Here's a server that supports 3TB of memory for $130, you get 3TB by filling all 24 memory slots with 128GB LRDIMMs, 2TB with 16:

https://www.ebay.com/itm/176298520843

Here are 128GB LRDIMMs for $98:

https://www.ebay.com/itm/196305803969

For 2TB and the server you're at $1698. You can get a drive bracket for a few bucks and a 2TB SSD for $100 and have almost $200 left over to put faster CPUs in it if you want to.

That's stinking Optane, would work if you're desperate. Normal 128GB LRDIMMs cost more than other DDR4 DIMMs. You can, however, get DDR4 RDIMMs for ~$1/GB:

https://www.ebay.com/itm/186345903230

With 32GB RDIMMs that machine would max out at 768GB, which could still run a 1T model at q4 or grok at FP16. And then it would cost less than $1000.

Or find a quad-socket system with 48 memory slots and then use 64GB LRDIMMs ($1.12/GB):

https://www.ebay.com/itm/176299295509

The quad socket systems aren't $200, but you can find them for $550 or so:

https://www.newegg.com/hp-proliant-rack-mount/p/2NS-0006-3E5...

Maybe less if you shop around (they're not as common).


How slow? Depending on the task I fear it could be too slow to be useful.

I believe there is some research on how to distribute large models across multiple GPUs, which could make the cost less lumpy.


You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth. That's assuming it's well-optimized and memory rather than compute bound, but those are often both true or pretty close.

And "depending on the task" is the point. There are systems that would be uselessly slow for real-time interaction but if your concern is to have it process confidential data you don't want to upload to a third party you can just let it run and come back whenever it finishes. And releasing the model allows people to do the latter even if machines necessary to do the former are still prohibitively expensive.

Also, hardware gets cheaper over time and it's useful to have the model out there so it's well-optimized and stable by the time fast hardware becomes affordable instead of waiting for the hardware and only then getting to work on the code.


Why would increasing memory bandwidth reduce performance? You said "You can get a decent approximation for LLM performance in tokens/second by dividing the model size in GB by the system's memory bandwidth"


Yeah the sentence is backwards, you divide the system's memory bandwidth by the size of the model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: