FlexGen: Running large language models on a single GPU

kkielhofner · on March 26, 2023

What's really amazing about a lot of these recent projects is they tend to provide benchmarks running on an Nvidia T4. They use these because they're relatively cheap from cloud providers and you can usually actually get them (as opposed to requesting and getting denied for an A100 or whatever).

For those that aren't familiar with it they are tiny power and density optimized GPUs. I have the successor (A2) and total max TDP is 60 watts. Single slot. Slot only power, and passively cooled.

Depending on workload I observe it to be roughly 5-10x slower than a 3090, which means for most people at home with a spare Nvidia gaming card (or whatever) you'll see results from these project at a performance multiple of the benchmarks they provide.

The one caveat is that the T4/A2 have 16GB VRAM, which makes them more capable (albeit slower) than a "low end" desktop card like the 3070 which has only 8GB VRAM. But as HN readers know there is incredible progress daily to reduce VRAM requirements for these models!

LoganDark · on March 27, 2023

> a "low end" desktop card like the 3070 which has only 8GB VRAM

I'm finding that my 12GB 3060 is not enough to run even LLaMA-7B without CPU offload, thank Arceus for quantization...

kkielhofner · on March 27, 2023

Give it time - there's a lot of incredible work being done here and we'll likely eventually see some kind of black magic to make it possible.

ByThyGrace · on March 26, 2023

I'm just here waiting for the "LLM retard guide" to come out, as it happened for stable diffusion last August.

throwaway4aday · on March 26, 2023

Copy paste the parts of the install steps for LLaMA C++ into ChatGPT and ask it to explain things simply and include any prerequisite steps you might need to do. If you get stuck, just ask ChatGPT and include any error messages.

https://github.com/ggerganov/llama.cpp

LoganDark · on March 27, 2023

LLaMA C++ is easy to set up, but it's sort of slow even pegging all cores of an overclocked i5-12400F, so I'd like to get a 3060 working (with 12GB of VRAM)... it's just it's so difficult to get some sort of ChatGPT-style interface running. Haven't gotten anything working yet even though I have a working CUDA PyTorch. Bleh bleh bleh~

arthurcolle · on March 26, 2023

Link to stable diffusion ref you're referring to? I was able to run the model and everything, so pretty familiar, but just wondering if you're referring to a specific document! Haha

ByThyGrace · on March 28, 2023

Apparently this is the latest iteration: https://rentry.org/voldy

arthurcolle · on April 3, 2023

Link is not working

ByThyGrace · on April 14, 2023

It is working for me right now.

LoganDark · on March 27, 2023

OK, looks like the "LLM retard guide" is "run this installer": https://github.com/oobabooga/text-generation-webui/releases/...

That's the only thing that worked for me, but it was so easy and worked instantly.

It's a set of easily-auditable batch/bash scripts (depending on platform).

zargon · on March 26, 2023

Previous discussion is here (266 comments): https://news.ycombinator.com/item?id=34869960

lxe · on March 26, 2023

Best way to mess around with FlexGen and LLMs on local hardware in general is https://github.com/oobabooga/text-generation-webui

boppo1 · on March 26, 2023

Can't get it to run on amd 6700xt even though there's ROCm installation instructions. Tried to run llama 7b but got hung up because bitsandbytes calls CUDA.

kkielhofner · on March 27, 2023

Par for the course, I'm afraid.

Over the years I've made several attempts at AMD/ROCm. I've wasted so much time AMD owes me free GPUs for life ;).

At this point I've just accepted that AMD is irrelevant in ML and they're fine with that - they're not making much of an effort to change that. I really and truly wish that wasn't the case but I've accepted that it is.

meragrin_ · on March 27, 2023

> At this point I've just accepted that AMD is irrelevant in ML and they're fine with that - they're not making much of an effort to change that. I really and truly wish that wasn't the case but I've accepted that it is.

Same here. AMD's neglect of non-gaming uses is likely how Intel is going to become number 2 in GPUs.

freeqaz · on March 27, 2023

Is there an equivalent to Nvidia-Docker for ROCm to make it brain dead easy? Where does the complexity come in for AMD GPUs?

kkielhofner · on March 27, 2023

It's far more complicated than that. Take a look at the Nvidia Frameworks Support Matrix[0]. The Nvidia PyTorch Docker container is 20GB uncompressed(!!!!) and consists of dozens of layers of Nvidia/CUDA tailored software stacks[1] that all come together to do this magical ML/AI stuff we take for granted on CUDA hardware. BTW, we have a nice, clean Nvidia CUDA docker situation because Nvidia has been working on it for years. It's rock solid and universally supported.

Moving to a lower level, check out the release notes for the latest Nvidia driver and look in amazement at the sheer number of supported GPUs[2]. In short, literally every GPU they've put in a laptop, desktop, workstation, or datacenter over the past decade (plus their embedded Jetson stuff). There are over 2,000 GPUs listed there and I can tell you from experience the "Unified" in CUDA holds up. If the card is supported by the driver, it has a compatible compute arch, and enough VRAM whatever you throw at it will just work. In many cases even on Windows! People complain about the proprietary driver but the fact is if your GPU says Nvidia on it you can install the one driver on any distro and have these projects up and running in minutes.

Compare that to the ROCm "list"[3] of what, maybe a dozen GPUs? I know from experience (unfortunately) that even with "supported" hardware in many cases just getting the driver to work is a nightmare. Then you have basic frameworks randomly crashing, etc. It's a complete mess and like I said - AMD is a decade behind today and Nvidia's lead is only growing.

[0] - https://docs.nvidia.com/deeplearning/frameworks/support-matr...

[1] - https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorc...

[2] - https://download.nvidia.com/XFree86/Linux-x86_64/530.41.03/R...

[3] - https://docs.amd.com/bundle/Hardware_and_Software_Reference_...

simfree · on March 27, 2023

When you follow ROCm's happy path it works well for getting things like PyTorch running, but you do end up putting up with enterprise distros like Ubuntu 22.04 LTS and its often broken by default network stack.

Enabling ROCm support for unlisted AMD GPUs & APUs is a feature flag away if your running one of these janky enterprise distros that AMD has blessed.

Debian is slowly getting ROCm and PyTorch's packages into the main Debian archive, once complete using AMD hardware for machine learning should be a breezey Apt Install ROCm away, as Debian's machine learning team chooses sane defaults like enabling APU support by default. Derivative distros like PopOS, Linux Mint, Zorin, Kali, Ubuntu and such will inherit this easy support from upstreaming ROCm as well.

kkielhofner · on March 27, 2023

Genuinely curious - when you say “PyTorch runs” what are you doing with it?

When I last tried it a year or so ago it was (more or less) useless. Yes it “ran” but there were weird crashes and edge cases all over the place. Throw in the fact that 95% of documentation, examples, tools, benchmarks, etc are still for CUDA after four years or so of embarking on this journey I think I’ve finally given up for good.

I happily and very firmly live in Nvidia/CUDA land now where I can see a story on HN and have it running on my GPU in under 10 minutes. Or 30 seconds if it’s a docker container.

You’re brave for running on unsupported GPUs! My experience was bad enough with the “supported” ones :).

simfree · on April 8, 2023

It seems stable with Whisper and a few other models...

boppo1 · on March 27, 2023

how do I turn on this feature flag? I'm quite happy on xubuntu.

simfree · on April 8, 2023

Read their GitHub, there is an issue where they cover the feature flag you need to set to enable ROCm on unsupported platforms.

boppo1 · on March 27, 2023

Adding to this: I used AMD's official amdgpu tool to install ROCm and now my card is failing to be recognized on boot by ubuntu ~80% of the time, causing a failure to load lightdm & start up. Tomorrow I'll dig around in journalctl to see if I can fix it.

I'm thiiiiis close to throwing up my hands and getting a 4070 but I've been advised that other than ML/CUDA, AMD>Nvidia for linux. Also the 4070 is 2x the price for just slightly better performance.

kkielhofner · on March 27, 2023

I’ve been there. Feel free to keep at it but unless your goal is to be a ROCm developer and contribute to the driver, upstream projects, etc IMO it’s just not worth it.

There is so much interesting, educational, and productive stuff going on with GPUs these days… At some point people need to ask themselves “did I buy a GPU to learn/use ML, or did I buy a GPU to beta test AMDs software stack?”

When it comes to GPU compute Nvidia cards are substantially cheaper if you value your time and sanity.

Note I don’t have any vested interest in Nvidia whatsoever, I’m just mad at myself for spending my time and money thinking AMD actually cared about any of this - they clearly don’t and the lack of progress I’ve seen and experienced over the past five years I’ve tried with them is insulting to me and the rest of their so called user base.

boppo1 · on March 27, 2023

Any word on who has better vulkan support?

kkielhofner · on March 27, 2023

I don't know or have any experience because I've never needed/been interested in it.

LoganDark · on March 27, 2023

Thank you so so much, this was the only way to get LLaMA running on my desktop's GPU! Everything else was plagued by everything from compile errors to version mismatches to miscompiled wheels to weird contradictions or whatever. I'm so happy that this works. I can finally use an LLM to my heart's content without relying on OpenAI and their stupid server load and phone number requirement

stavros · on March 27, 2023

What's the "best"/most like GPT-4 model to use with this?

lxe · on March 27, 2023

https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b

stavros · on March 27, 2023

Excellent, thanks!

neilv · on March 26, 2023

> This project was made possible thanks to a collaboration with ... Yandex Research ...

I'm all for global cooperation and fellowship. Are sanctions going to be a barrier for this and related projects?

blagie · on March 26, 2023

It depends on the work and the sanctions regime, but in general:

- The best sanctions impact targeted industries (e.g. anything needed to build tanks, warplanes, etc.).

- The worst sanctions impact communications and collaboration. Change comes from conversations. Media, non-military education, and non-military academic collaboration are bad usually bad sanctions.

01100011 · on March 26, 2023

Well, the sanctions are an attempt at denying China access to more powerful GPUs like the A100, so it seems the sanctions may actually accelerate research into running LLMs on lower performing HW.

kkielhofner · on March 27, 2023

What's so weird about that (not surprising because lawmakers/politics) is that A100s are actually terrible at inference. Assuming the model can fit in RAM my 3090 handily beats them across the board at everything I've tried. My 4090 blows them away. So by the most commonly accepted definition of "performance" the GPUs banned in China are actually /lower/ performance. The only thing this ban does is slow down from scratch training of very large models.

There's a lot of effort on finetuning these large models on < 80GB VRAM hardware but significantly more effort on "running" them on < 24GB VRAM hardware (3090/4090 and lower).

In the end your point stands - between any efforts by those impacted by sanctions or just the hobbyist/low-mid range commercial applications research into cramming these huge models into "consumer" hardware are accelerating dramatically.

riku_iki · on March 27, 2023

3090 is not allowed to be used in datacenters by nvidia, which China likely doesn't care about, so they are in advantage at the end of the day.

kkielhofner · on March 27, 2023

Interestingly that (in)famous language is only included in the driver EULA for the Nvidia Windows driver, not the Linux driver (last I looked).

SXX · on March 27, 2023

Yandex still operate a lot of services in EU under Yango branding. They trying to hide their relationships with Kremlin-controlled russian Yandex, but do it quite badly. So I not sure why you decide sanctions will affect their research wing.

Also it's likely good part of their research staff already work from Israel and Serbia even though they not finished splitting the company into Russian one controlled by Kremlin and international one.

PS: As I see only contributor in repository from Yandex is located in Hungary and not in Russia.

lxe · on March 26, 2023

Bing says:

    I could not find any specific information on how these sanctions affect academic research and cooperation between the US and Russia. Some sources suggest that some academics have canceled conferences, joint projects, and funding with Russian institutions as a form of self-imposed sanctions2, while others indicate that Russian students are still able to secure visas to study abroad3. Therefore, the impact of the sanctions on academic research and cooperation may vary depending on the field, institution, and individual circumstances.

stuckinhell · on March 26, 2023

This is absolutely stunning work. Excited to try it out on my husband's homelab.

stavros · on March 26, 2023

What's the currently best-performing LLM that one can run with this?

aliljet · on March 26, 2023

This is absolutely something I was running into with LLaMA. I'm curious if this potentially extends into that particular use case...

htrp · on March 27, 2023

Readme updated 19 hours ago.

Anything new with the codebase?