Hacker Newsnew | past | comments | ask | show | jobs | submit | petuman's commentslogin

> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: