What is the fastest documented way so far to serve the full R1 or V3 models (Q8,... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		pama 12 months ago \| parent \| context \| favorite \| on: How to Run DeepSeek R1 671B Locally on a $2000 EPY... What is the fastest documented way so far to serve the full R1 or V3 models (Q8, not Q4) if the main purpose is inference with many parallel queries and maximizing the total tokens per sec? Did anyone document and benchmark efficient distributed service setups?

manmal 12 months ago | [–]

The top comment in this thread mentions a 6k setup, which likely could be used with vLLM with more tinkering. AFAIK vLLM‘s batched inference is great.

snovv_crash 12 months ago | [–]

You need enough VRAM to hold the whole thing plus context. So probably a bunch of H100s, or MI300s.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact