Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Are there any ready-to-use Docker images for running LLMs locally?
1 point by jiggawatts on April 23, 2023 | hide | past | favorite | 6 comments
I've been having a lot of trouble spinning up the various stacks for running open LLMs like Alpaca or Vicuna because they often require specific CUDA versions, specific gcc toolchains, etc...

Has anyone got a dockerfile or published container image that "just works" and can run 4-bit quantized models on CPUs and/or GPUs? Ideally something that will run StableLM.

I've tried to build such a thing myself, but I found that the vague instructions in blogs aren't sufficient for reproducible build. Too many instances of "clone this (every changing) Git repo" or "just curl & execute this", leading to very rapid bit-rot where even instructions from a month ago can't be reproduced!



4 bit llama is messy because there are essentially 3 variants.

- 4 bit cuda

- 4 bit triton

- 4 bit cpu

https://github.com/oobabooga/text-generation-webui/blob/main...

Models should be quantized specifically for each, and both branches are under heavy (daily) development... you really want to git pull them all the time.

What OS are you running? A linux distro I presume?


I'm using Windows 11 with WSL 2, so in principle I can run Linux CUDA docker images.

While I have 64 GB of memory and an okay NVIDIA GPU, what I would like is a Docker image I can experiment with locally, but run on a cloud-hosted "spot" instance if I want to do something heavyweight, without having to run through the install process from scratch each time.

Similarly, with the quantisation process, for some reason few such files are being published. It apparently takes lots of CPU and RAM to produce the 4-bit versions. With a container image, I could do this in the cloud, then download the result to run on a more constrained device, etc...


Ah. Well you can build a dockerfile (with the 4 bit triton repo) locally here, then deploy it: https://github.com/oobabooga/text-generation-webui#alternati...

I havent explored beyond that.

Actually there are tons of quantizatons on hugging face, just search for "4bit". 128g usually (but not always) means a Triton variant, ggml means a cpu variant. I know that vicuna 7/13B and the "uncensored" vicuna 7B are quantized on sf, but there are already some newer, supposedly better instruction following mixes.


This line jumps out at me right away:

    RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa /build
This makes this Dockerfile completely non-reproducible. I know how to work around it, but seriously, is this how Python users in the AI/ML space do things regularly!? Just make something work "once" on "my machine" with the "current commit" and then call it a day?

It's a really shocking experience coming form the Rust and C# ecosystems where everything is strongly versioned, packages have "lock" files, and there's even SHA hashes to really pin things down. Meanwhile in this world it's just: "Get the latest version of this thing. It won't work, of course, because it only worked for a week in 2023 when I did it, but good luck!"


> Seriously, is this how Python users in the AI/ML space do things regularly!? Just make something work "once" on "my machine" with the "current commit" and then call it a day?

Bingo! Welcome to ML land, you are truly starting to understand.

Honestly the oobabooga repo is 100x better than average. Most ML repos are one off research papers or demonstrations that barely (or don't) work on precise python environments and then are completely abandoned. Others live in dependency hell, with CUDA hell in particular being very common.


I'm using Serge[0] as an API for a local Discord bot. You probably won't find anything for StableLM this soon after release, but this will download and run the Ll*ma stuff with a decent web UI.

[0] https://github.com/nsarrazin/serge




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: