Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In case anyone is wondering, yes, there is a cost when a model is quantized.

https://oobabooga.github.io/blog/posts/perplexities/

Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.

Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.



I was hoping that link would answer the question that's been bugging me for months: what are the penalties that you pay for using a quantized model?

Sadly it didn't. It talked about "perplexities" and showed some floating point numbers.

I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."


I have several sets of quant comparisons posted on my HF spaces, the caveat is my prompts are all "English to code": https://huggingface.co/spaces/mike-ravkine/can-ai-code-compa...

The dropdown at the top selects which comparison: Falcon compares GGML, Vicuna compares bits and bytes. I have some more comparisons planned, feel free to open an issue if you'd like to see something specific: https://github.com/the-crypt-keeper/can-ai-code


  I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."
We suck at evaluating and comparing models imo. There are metrics and evaluation task, but it's still very subjective.

The closer we get to assessing human like performance, the tougher it is, because it becomes more subjective and less deterministic by the nature of the task. I don't know the answer, but I know that for the metrics we have it's not so easy to translate them into any idea about the kind of performance on some specific thing you might want to do with the model.


Not mathematically, at the very least. Perplexity is a translation of the best measure we have for informing us how a model is doing empirically over a test dataset (both pre and post). It is enough to be, usably, at least the final word on how different quantization methods perform.

Subjective ratings are different, but for compression things are quite well defined.


> some specific thing you might want to do with the model.

I think this right here is the answer to measuring and comparing model performance.

Instead of trying to compare models holistically, we should be comparing them for specific problem sets and use cases... the same as we compare humans against one another.

Using people as an example, a hiring manager doesn't compare 2 people holistically, they compare 2 people based on how well they're expected to perform a certain task or set of tasks.

We should be measuring and comparing models discriminately rather than holistically.


You could have two models answer 100 questions the same way, and differ on the 101st. They’re unpredictable by nature - if we could accurately predict them we’d just use the predictions instead.


(Stupid question) are models still non-deterministic if you set the temperature to zero?

Would setting the temperature to zero degrade the quality of response?


Even at T=0 and run deterministically, the answers still have "randomness" with respect to the exact prompt used. Change wording slightly and you've introduced randomness again even if the meaning doesn't change. It would be the same for a person.

For an llm, a trivial change in wording could produce a big change in answer, same as running it again with a new random seed. "Prompt engineering" is basically overfitting if not approached methodically. For example, it would be interesting to try deliberate permutations of an input that don't change the meaning and see how the answer changes as part of an evaluation.


But if T=0 and you use the exact same input (not a single word or position changes) do you get the same output? Reading your response it implies that the randomness is related to even slight changes.


As a sibling comment mentioned, threading on a gpu is not automatically deterministic so you could randomness from there, although I can't think of anything in the forward pass of a normal LLM they would depend on execution order. So yes, you should get the same, it's basically just matrix multiplication. There may be some implementation details I don't know about that would add other variability though.

Look at this minimal implementation (Karpathy's) of LLaMA, the only randomness is in the "sample" function that comes in at non-zero temperature, otherwise its easy to see everything is deterministic: https://github.com/karpathy/llama2.c/blob/master/run.c

Otoh, with MoE like GPT-4 has, it can still vary at zero temperature.


Some GPU operations give different results depending on the order they are done. This happens because floating point numbers are approximations and lose associativity. Requiring a strict order causes a big slowdown.


Well the same is true for people, and yet hiring managers still available valuate for specific tasks.


It makes the model dumber.

That seems simplistic, but its really simple as that. Naive 3 bit quantization will turn llama 7B into blubbering nonsense.

But llama.cpp quantization is good! I recommend checking out the graphs ikawrakow made for their K-quants implementation:

https://github.com/ggerganov/llama.cpp/pull/1684

Basically, the more you quantize with K-quant, the dumber the model gets. 2 bit llama 13B quant, for instance, is about as dumb as 7B F16, but the dropoff is not nearly as severe from 3-6 bits.


FWIW here's why perplexity is useful: it's a measure of uncertainty that can easily be compared between different sources. Perplexity k is like the uncertainty of a roll of a k-sided die. Here I think perplexity is per-token, and is a measuring the likelihood of re-generating the strings in the test set.

e.g. take a look at these two rows:

    llama-65b.ggmlv3.q4_K_M.bin 4.90639 llama.cpp
    llama-65b.ggmlv3.q3_K_M.bin 5.01299 llama.cpp
So for the reduction in size given by (q4 -> q3), you get a 2% increase in the uncertainty. Now, that doesn't tell you which specific capabilities get worsened (or even if that's really considered a huge or tiny change), but it is a succinct description of general performance decreases.

If you want more fine-grained explanations of how generation of certain types of texts get clobbered, you would probably need to prepare datasets comprised of that type of string, and measure the perplexity delta on that subset. i.e.

    dperplexity/dquantization(typed_inputs).
I think it might be more difficult to get a comprehensive sense of the qualitative differences in the other direction, e.g.

    dtype/dquantization(all_outputs).


The problem is that it's not consistent enough for a good demo. Not even two different models, but even two different fine tunes of the same base model may be wildly differently affected by quantization. It can range from making hardly a difference to complete garbage output.


I have been using nat.dev to compare quantized models and it works great.


Just the other day someone published ARC comparison results for different quants as well as the code for the harness that they used to easily run lm-eval against quants to your heart's content: https://www.reddit.com/r/LocalLLaMA/comments/15rh3op/effects...


it will be different for every usecase, the only way to find out is spinning one up..


would an answer "there aren't much significant penalties" suffice?


>Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter.

I'm not sure what you mean by this. You've always been able to run GPT3 on a single server (your typical 8xA100).


8xA100 is technically a single server, but I think OP is talking about affordable and plentiful CPU hosts, or even relatively modest single GPU instances.

DGX boxes do not grow on trees, especially these days


Am I missing something or how do you know this? Also I think the OP was talking about a single card not multiple but that was just my reading.


Because 175B parameters (350GB for the weights FP16, let's say a bit over 400GB for actual inference), fit very comfortably on 8xA100 (640GB VRAM total).

And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)


> And basically all servers will have 8xA100

for those wondering: no this is not the norm. My lab at CMU doesn't own any A100s (we have A6000s).


The servers the commenter is talking about are DGX machines from NVIDIA.

It doesn’t really make sense to BTO. What you gain economically you lose in the science you can do.

But nobody could have anticipated this.


you could also get HGX from any of the vendors.


Wtf does HGX mean? God enough with the acronyms people.

Please take an extra ten seconds to speak in proper human language!

You could save on the worlds carbon footprint by reducing the number of times humans have to search for “what is NVIDIA hgx?” or is it “what is AMD HGX” and then subsequently visiting the websites to see if that’s right or not.


What does Wft mean? God enough with the acronyms people. /s


You got me there hahaha

However, there’s a difference between an acronym known to the broader public versus some single shot, context-specific one!


who's norm? I assure you it's the norm. :)


> And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)

wishing, or guessing something, without actual experience, doesn't make it true.


The effect is lesser than you think. 5 bit quantization has negligible performance loss compared to 16 bits: https://github.com/ggerganov/llama.cpp/pull/1684


This paper from last month has a method for acceptable 3-bit quantization and a start at 2-bit.

https://arxiv.org/abs/2307.13304


Yes, there is a logarithmically-bound (or exponential if you're viewing it from another angle) falloff in the information lost in quantization. This comes from the non-uniform "value" of different weights. We can try to get around them with different methods, but at the end of the day, some parameters just hurt more to squeeze.

What is insane though is how far we've taken it. I remember when INT8 from NVIDIA seemed like a nigh-pipedream!


Good blog post, shame the site has no RSS feed!


Could this be why people recently say they see more weird results in ChatGPT? Maybe OpenAI is trying out different quantization methods for the GPT4 model(s) to reduce resource usage of ChatGPT.


I'd be more inclined to believe that they're dropping down to gpt-3.5-turbo based on some heuristic, and that's why sometimes it gives you "dumber" responses. If you can serve 5/10 requests with 3.5 by swapping only the "easy" messages out, you've just cut your costs by nearly half (3.5 is like 5% of the cost of 4).


Serving me ChatGPT 3.5 when I explicitly requesting ChatGPT 4 sounds like a very bad move? They're not marketing it like "ChatGPT Basic" and "ChatGPT Pro".


Thank you! Is there a sweet spot with quantization. how much can you quantize for given model type and size and still be useful.


Tim Dettmers recently (https://www.manifold1.com/episodes/ai-on-your-phone-tim-dett...):

"But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.

And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust."


> And now we find if you can go to four bits

That will be really interesting for FPGAs, because the current ones are basically oceans of 4-bit computers.

Yes, you can gang together a pair of 4LUTs to make a 5LUT, and a pair of 5LUTs to make a 6LUT, but you halve your parallelism each time you do that. OTOH you can't turn a 4LUT into a pair of 3LUTs on any currently-manufactured FPGA. It's simply the "quantum unit" of currently-available hardware -- and it's been that way for at least 15 years (Altera had 3LUTs back in the 2000s). There's no fundamental reason for the number 4 -- but it is a very, very deep local minimum for the current (non-AI) customers of FPGA vendors.


This is not generally true, sometimes quantisation can improve accuracy. I haven't seen that with LLMs yet though.


Interesting, how would that work? Are there any well-known examples?

Is it: the weights all happen to be where float is sparse, so quantization ends up increasing fidelity? Or is it more of a “worse is better” dropout-type situation?


I suspect it works as regularisation of the network. It usually happens when you train with quantisation instead of post-training quantisation, an I haven't seen that done with LLMs yet.


For image recognition it can sometimes be like that. My gut feeling is that lowering from fp32 to fp16 can get rid of some kind of overfitting or so.


Any use case for using the 7B model over the 13B, quantized?


Inference speed. Sometimes 7B is good enough for the task at hand, and using 13B just makes you wait longer.


SBC


Wtf does SBC mean? God enough with the acronyms people.


In my experience, it usually means Small Block Chevy, but in certain communities it means Single Board Computer, an older way of referring to devices like the Raspberry Pi.

I would elaborate and say, anywhere that your computer is resource constrained ( ram, processing power ) but you still want to make up articles for your Amazon Affiliate blog


Single board computer makes sense. I wish folks would type things out.


In this context I'd assume SBC means Single Board Computer, such as a Raspberry Pi or one of the many imitators. The article itself mentions running LLaMa on a Pi 4.

The interesting implication about running an LLM on a single board computer is that if it's a proof of concept for an LLM on a smartphone. If you have a model that can produce useful results on a Ras Pi, you have something that could potentially run on hundreds of millions of smartphones. I'm not sure what the use case is for running an LLM on your phone instead of the cloud, but it opens some interesting possibilities. It depends just how useful such a small LLM could be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: