Everyone knows the hf leaderboard is actively being gamed (Goodhart's law strikes again), but the guy who wrote the Medium post is active in doing stuff with models, and the tutorial is (clearly) pretty good.
The only leaderboard this model is good on is the HuggingFace LLM Leaderboard which is known to be manipulated and victim to gross overfit. The Lmsys Arena Leaderboard is a better representation of the best models.
Thank you!!! _And_ it has the proprietary models...insanely more useful.
It's a bug, not a feature, that the stock leaderboard ends up with endless fine tunes, and as you point out and is demonstrated by the article, its more about something else than about quality
Even that chatbot arena shows that many models freely available and open source are better than some versions of GPT-3.5 and are within a small stones throw of the latest GPT 3.5
Note that it only includes gpt-3.5-turbo (the current iteration), not the original gpt-3.5. It's not exactly a secret that "turbo" models are noticeably dumber than the originals, whatever OpenAI says. There's no free lunch - that's why it's so much cheaper and faster...
That said, we do have public 120b models now that genuinely feel better than the original gpt-3.5.
The holy grail remains beating gpt-4 (or even gpt-4-turbo). This seems to be out of reach on consumer hardware at least...
Same for the ChatGPTs post launch (let's not talk about 11_02 :) )
-- and as long as we're asserting andecdotes freely, I work in the field and have a couple years in before ChatGPT -- it most certainly is not a well-kept secret or a secret or true or anything else other than standard post-millenial self-peasantization.
"Outright lie" is kinder toward the average reader via being more succinct, but usually causes explosive reactions because people take a while to come to terms with their ad-hoc knowledge via consuming commentary is fundamentally flawed, if ever.
That just goes to show how useless the rankings are in general. If you actually use it, you'll quickly notice that older GPT-4 models are noticeably better at tasks that require reasoning.
gpt-4-turbo also has an extremely annoying quirk where instead of producing a complete response, it tends to respond with "you can do it like this: blah blah ...; fill in the blanks as needed", where the blank is literally the most important part of the response. It can sometimes take 3-4 rounds to get it to actually do what it's supposed to do.
But it does produce useless output much faster indeed.
This isn't true, I'm sorry. That may be your experience with it but it's not about the model. Using it is my day job and I've never ever seen that language.
It's frustrating for both of us, I assume.
I'm tired of people fact-free asserting it got worse because don't you know other people saw it got worse? And it did something bad the other day.
You're tired of the thing not doing the thing and you have observed it no longer does the thing. And you certainly shouldn't need to retain past prompts just to prove it.
In the speed/quality trade-off sense, there have /often/ been free lunches in many areas of computer science, where algorithmic improvements let us solve problems orders of magnitude faster. We don't fully understand what further improvements will be available for LLMs.
Free lunch here relates to pricing/speed I would say, because gpt-4 and gpt-4-turbo are sold together. If gpt-4-turbo is cheaper, faster and has much larger context window, why would it make sense to also sell gpt-4... Unless it's a marketing trick or perhaps for backwards compatibility, which could also be.
Using Mistral ft optimized 1218 7b q5km - very useful for basic queries/creative input. Often just as useful as ChatGPT and feels far more "real" to have it fully local, don't want to depend fully on one proprietary service for something as fundamentally useful as this!
I have spent more time than I would like to admit trying to get 3B-34B models working for "serious" usecases like RAG, code comprehension/generation and text summarization, across dozens of "leading" free LLMs, and none of them (not even Mixtral MoE or its derivatives) come close to gpt-3.5-turbo for consistency and depth, and there is no equal for gpt-4(-turbo).
LLM Leaderboards and benchmarks do not show the complete picture for YOUR specific usecase.
If you can post results that would be extremely extremely helpful. This is what I've been looking for -real world performance - rather than gamed leaderboards
That's just it: It is highly dependent on your inputs, expectations and usecase. My observations below.
Mixtral does about average at summarization. Not enough details are presented in the summary. Passable for non-decisionbound workflows. Using quantized models injects occasional spelling errors, so fall back on Mistral unquantized. Skip all the Mistral finetunes. Only choose Instruct models.
There are no good local code comprehension/generation models (and that includes CodeLlama, DeepSeek, Starcoder, WizardCoder, Phind). You will spend as much time guiding/rerolling/correcting output as coding it yourself, assuming your task is nontrivial and exceeds 100+ LoC. Code completion via continue.dev using local models did not yield good results, either.
No local models are good at agentic tasks (e.g., via AutoGen). You might have better luck rolling your own with function-calling via NexusRaven2 than using Agent frameworks with local LLMs.
Claude-Instant edges out gpt-3.5-turbo for longform summarization with its 100k context. It is what I use for https://HackYourNews.com because chunked/rolling summarization is noisy.
For every non-hobby task, I am switching to gpt-4-*, choosing between base, 32k, and turbo depending on speed/correctness/cost/length tradeoffs. They are not perfect, but there is no competition.
Yes it is usecase specific but thats also why its crucial for real world trial results to be shared so that theres a better understanding thats qualitatively accurate.
If the end goal is document classification and/or semantic search, the Reexpress Fast I model (3.2 billion parameters) is a good choice. The key is that it produces reliable uncertainty estimates (for classification), so you know if you need a larger (or alternative) model. (In fact, an argument can be made that since the other models don't produce such uncertainty estimates, they are not ideal for serious use cases without adding an additional mechanism, such as ensembling with the Reexpress model.)
It seems that a lot of the leaders are the results of mixing finetunes, which really makes me think that there was a leak of test sets into the training data.
To reply to myself.
I am not saying that this model did, or that even if it did, that it was done intentionally. ML is hard, and there are so many ways for data to leak.
What I AM surprised about is that it is not clear what CultriX did that was better than what a ton of others have done.
it's trained with a lora adaptor, so it's either an error or they also count the adaptor. they use a 16 param inner lora dimension however, so it's unlikely that that's the reason (too small)
an important point to keep in mind is that at inference, the lora adaptors are made to be merged into the base model so they don't affect inference speed. (you need to explicitly do it though, if you train your own adaptor)
I'm as quick to jump on the Medium roastwagon as anyone else, but I will say Towards Data Science has a surprising number of quality tutorials running the full spectrum of data science tasks.
That and they have great SEO, you basically can't avoid them.