Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Managed to get 1.8k tokens per second with a batch of 60 when running vLLM with Mistral 7B on an A100 40GB in bfloat16 mode. Pretty damn fast!

vllm==0.2.0 got released an hour or so ago, so it's pretty fresh. Let me know fi you'd like anything else in there.



I prefer GPT-4's low speed to any other model's fast speed because with these models, quality is the most important thing.


i agree with your sentiment but keep in mind speed (slowness) could be a red herring. i find it plausible that while they degrade the quality of GPT4 in order to (presumably) lower their costs (while maintaining or increasing the price), they might add subtle slight delays to give the impression that the app is doing hard quality work.

kind of like that infamous android virus scanner app that just had a timer controlling the work in progress animation to give the impression of quality work being done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: