Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.
Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.
The dropdown at the top selects which comparison: Falcon compares GGML, Vicuna compares bits and bytes. I have some more comparisons planned, feel free to open an issue if you'd like to see something specific: https://github.com/the-crypt-keeper/can-ai-code
I want to see examples like "here's a prompt against a model and the same prompt against a quantized version of that model, see how they differ."
We suck at evaluating and comparing models imo. There are metrics and evaluation task, but it's still very subjective.
The closer we get to assessing human like performance, the tougher it is, because it becomes more subjective and less deterministic by the nature of the task. I don't know the answer, but I know that for the metrics we have it's not so easy to translate them into any idea about the kind of performance on some specific thing you might want to do with the model.
Not mathematically, at the very least. Perplexity is a translation of the best measure we have for informing us how a model is doing empirically over a test dataset (both pre and post). It is enough to be, usably, at least the final word on how different quantization methods perform.
Subjective ratings are different, but for compression things are quite well defined.
> some specific thing you might want to do with the model.
I think this right here is the answer to measuring and comparing model performance.
Instead of trying to compare models holistically, we should be comparing them for specific problem sets and use cases... the same as we compare humans against one another.
Using people as an example, a hiring manager doesn't compare 2 people holistically, they compare 2 people based on how well they're expected to perform a certain task or set of tasks.
We should be measuring and comparing models discriminately rather than holistically.
You could have two models answer 100 questions the same way, and differ on the 101st. They’re unpredictable by nature - if we could accurately predict them we’d just use the predictions instead.
Even at T=0 and run deterministically, the answers still have "randomness" with respect to the exact prompt used. Change wording slightly and you've introduced randomness again even if the meaning doesn't change. It would be the same for a person.
For an llm, a trivial change in wording could produce a big change in answer, same as running it again with a new random seed. "Prompt engineering" is basically overfitting if not approached methodically. For example, it would be interesting to try deliberate permutations of an input that don't change the meaning and see how the answer changes as part of an evaluation.
But if T=0 and you use the exact same input (not a single word or position changes) do you get the same output? Reading your response it implies that the randomness is related to even slight changes.
As a sibling comment mentioned, threading on a gpu is not automatically deterministic so you could randomness from there, although I can't think of anything in the forward pass of a normal LLM they would depend on execution order. So yes, you should get the same, it's basically just matrix multiplication. There may be some implementation details I don't know about that would add other variability though.
Look at this minimal implementation (Karpathy's) of LLaMA, the only randomness is in the "sample" function that comes in at non-zero temperature, otherwise its easy to see everything is deterministic: https://github.com/karpathy/llama2.c/blob/master/run.c
Otoh, with MoE like GPT-4 has, it can still vary at zero temperature.
Some GPU operations give different results depending on the order they are done. This happens because floating point numbers are approximations and lose associativity. Requiring a strict order causes a big slowdown.
Basically, the more you quantize with K-quant, the dumber the model gets. 2 bit llama 13B quant, for instance, is about as dumb as 7B F16, but the dropoff is not nearly as severe from 3-6 bits.
FWIW here's why perplexity is useful: it's a measure of uncertainty that can easily be compared between different sources. Perplexity k is like the uncertainty of a roll of a k-sided die. Here I think perplexity is per-token, and is a measuring the likelihood of re-generating the strings in the test set.
So for the reduction in size given by (q4 -> q3), you get a 2% increase in the uncertainty. Now, that doesn't tell you which specific capabilities get worsened (or even if that's really considered a huge or tiny change), but it is a succinct description of general performance decreases.
If you want more fine-grained explanations of how generation of certain types of texts get clobbered, you would probably need to prepare datasets comprised of that type of string, and measure the perplexity delta on that subset. i.e.
dperplexity/dquantization(typed_inputs).
I think it might be more difficult to get a comprehensive sense of the qualitative differences in the other direction, e.g.
The problem is that it's not consistent enough for a good demo. Not even two different models, but even two different fine tunes of the same base model may be wildly differently affected by quantization. It can range from making hardly a difference to complete garbage output.
Just the other day someone published ARC comparison results for different quants as well as the code for the harness that they used to easily run lm-eval against quants to your heart's content: https://www.reddit.com/r/LocalLLaMA/comments/15rh3op/effects...
>Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter.
I'm not sure what you mean by this. You've always been able to run GPT3 on a single server (your typical 8xA100).
8xA100 is technically a single server, but I think OP is talking about affordable and plentiful CPU hosts, or even relatively modest single GPU instances.
DGX boxes do not grow on trees, especially these days
Because 175B parameters (350GB for the weights FP16, let's say a bit over 400GB for actual inference), fit very comfortably on 8xA100 (640GB VRAM total).
And basically all servers will have 8xA100 (maybe 4xA100). Nobody bothers with a single A100 (of course in a VM you might have access to only one)
Wtf does HGX mean? God enough with the acronyms people.
Please take an extra ten seconds to speak in proper human language!
You could save on the worlds carbon footprint by reducing the number of times humans have to search for “what is NVIDIA hgx?” or is it “what is AMD HGX” and then subsequently visiting the websites to see if that’s right or not.
Yes, there is a logarithmically-bound (or exponential if you're viewing it from another angle) falloff in the information lost in quantization. This comes from the non-uniform "value" of different weights. We can try to get around them with different methods, but at the end of the day, some parameters just hurt more to squeeze.
What is insane though is how far we've taken it. I remember when INT8 from NVIDIA seemed like a nigh-pipedream!
Could this be why people recently say they see more weird results in ChatGPT? Maybe OpenAI is trying out different quantization methods for the GPT4 model(s) to reduce resource usage of ChatGPT.
I'd be more inclined to believe that they're dropping down to gpt-3.5-turbo based on some heuristic, and that's why sometimes it gives you "dumber" responses. If you can serve 5/10 requests with 3.5 by swapping only the "easy" messages out, you've just cut your costs by nearly half (3.5 is like 5% of the cost of 4).
Serving me ChatGPT 3.5 when I explicitly requesting ChatGPT 4 sounds like a very bad move? They're not marketing it like "ChatGPT Basic" and "ChatGPT Pro".
"But what we found with these neural networks is, if you use 32 bits, they're just fine. And then you use 16 bits, and they're just fine. And then with eight bits, you need to use a couple of tricks and then it's just fine.
And now we find if you can go to four bits, and for some networks, that's much easier. For some networks, it's much more difficult, but then you need a couple more tricks. And so it seems they're much more robust."
That will be really interesting for FPGAs, because the current ones are basically oceans of 4-bit computers.
Yes, you can gang together a pair of 4LUTs to make a 5LUT, and a pair of 5LUTs to make a 6LUT, but you halve your parallelism each time you do that. OTOH you can't turn a 4LUT into a pair of 3LUTs on any currently-manufactured FPGA. It's simply the "quantum unit" of currently-available hardware -- and it's been that way for at least 15 years (Altera had 3LUTs back in the 2000s). There's no fundamental reason for the number 4 -- but it is a very, very deep local minimum for the current (non-AI) customers of FPGA vendors.
Interesting, how would that work? Are there any well-known examples?
Is it: the weights all happen to be where float is sparse, so quantization ends up increasing fidelity? Or is it more of a “worse is better” dropout-type situation?
I suspect it works as regularisation of the network. It usually happens when you train with quantisation instead of post-training quantisation, an I haven't seen that done with LLMs yet.
In my experience, it usually means Small Block Chevy, but in certain communities it means Single Board Computer, an older way of referring to devices like the Raspberry Pi.
I would elaborate and say, anywhere that your computer is resource constrained ( ram, processing power ) but you still want to make up articles for your Amazon Affiliate blog
In this context I'd assume SBC means Single Board Computer, such as a Raspberry Pi or one of the many imitators. The article itself mentions running LLaMa on a Pi 4.
The interesting implication about running an LLM on a single board computer is that if it's a proof of concept for an LLM on a smartphone. If you have a model that can produce useful results on a Ras Pi, you have something that could potentially run on hundreds of millions of smartphones. I'm not sure what the use case is for running an LLM on your phone instead of the cloud, but it opens some interesting possibilities. It depends just how useful such a small LLM could be.
https://oobabooga.github.io/blog/posts/perplexities/
Essentially, you lose some accuracy and there might be some weird answers and probably more likely to go off the rail and hallucinate. But the quality loss is lower the more parameters you have. So for very large model sizes the differences might be negligible. Also, this is the cost of inference only. Training is a whole other beast and requires much more power.
Still, we are looking at GPT3 level of performance on one server rack. That says something when less than a year ago, such AI was literally magic and only run on a massive datacenter. Bandwidth and memory size are probably, in my ignorance mind, easier to increase than raw compute so maybe we will soon actually have "smart" devices.