Ask HN: Are there any ready-to-use Docker images for running LLMs locally?

brucethemoose2 · on April 23, 2023

4 bit llama is messy because there are essentially 3 variants.

- 4 bit cuda

- 4 bit triton

- 4 bit cpu

https://github.com/oobabooga/text-generation-webui/blob/main...

Models should be quantized specifically for each, and both branches are under heavy (daily) development... you really want to git pull them all the time.

What OS are you running? A linux distro I presume?

jiggawatts · on April 23, 2023

I'm using Windows 11 with WSL 2, so in principle I can run Linux CUDA docker images.

While I have 64 GB of memory and an okay NVIDIA GPU, what I would like is a Docker image I can experiment with locally, but run on a cloud-hosted "spot" instance if I want to do something heavyweight, without having to run through the install process from scratch each time.

Similarly, with the quantisation process, for some reason few such files are being published. It apparently takes lots of CPU and RAM to produce the 4-bit versions. With a container image, I could do this in the cloud, then download the result to run on a more constrained device, etc...

brucethemoose2 · on April 23, 2023

Ah. Well you can build a dockerfile (with the 4 bit triton repo) locally here, then deploy it: https://github.com/oobabooga/text-generation-webui#alternati...

I havent explored beyond that.

Actually there are tons of quantizatons on hugging face, just search for "4bit". 128g usually (but not always) means a Triton variant, ggml means a cpu variant. I know that vicuna 7/13B and the "uncensored" vicuna 7B are quantized on sf, but there are already some newer, supposedly better instruction following mixes.

jiggawatts · on April 23, 2023

This line jumps out at me right away:

    RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa /build

This makes this Dockerfile completely non-reproducible. I know how to work around it, but seriously, is this how Python users in the AI/ML space do things regularly!? Just make something work "once" on "my machine" with the "current commit" and then call it a day?

It's a really shocking experience coming form the Rust and C# ecosystems where everything is strongly versioned, packages have "lock" files, and there's even SHA hashes to really pin things down. Meanwhile in this world it's just: "Get the latest version of this thing. It won't work, of course, because it only worked for a week in 2023 when I did it, but good luck!"

brucethemoose2 · on April 23, 2023

> Seriously, is this how Python users in the AI/ML space do things regularly!? Just make something work "once" on "my machine" with the "current commit" and then call it a day?

Bingo! Welcome to ML land, you are truly starting to understand.

Honestly the oobabooga repo is 100x better than average. Most ML repos are one off research papers or demonstrations that barely (or don't) work on precise python environments and then are completely abandoned. Others live in dependency hell, with CUDA hell in particular being very common.

smoldesu · on April 23, 2023

I'm using Serge[0] as an API for a local Discord bot. You probably won't find anything for StableLM this soon after release, but this will download and run the Ll*ma stuff with a decent web UI.

[0] https://github.com/nsarrazin/serge