Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

With system builds like this I always feel the VRAM is the limiting factor when it comes to what models you can run, and consumer grade stuff tends to max out at 16GB or (somemtimes) 24GB for more expensive models.

It does make me wonder whether we'll start to see more and more computers with unified memory architecture (like the Mac) - I know nvidia have the Digits thing which has been renamed to something else



That’s what I hope for, but everything that isn’t bananas expensive with unified memory has very low memory bandwidth. DGX (Digits), Framework Desktop, and non-Ultra Macs are all around 128 gb/s, and will produce single digits tokens per second for larger models: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

So there’s a fundamental tradeoff between cost, inference speed, and hostable model size for the foreseeable future.


Go server GPU (TESLA) and 24 GB is not unusual. (And also about $300 used on eBay.)


But compute speed is very low.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: