Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Best 7B LLM on leaderboards made by an amateur following a medium tutorial (huggingface.co)
181 points by Der_Einzige on Jan 5, 2024 | hide | past | favorite | 49 comments


If only all people and companies were as honest as this guy about how much of their success they owe to luck and others!


If only VCs would be as smart...


And I thought I was being unrealistic


Everyone knows the hf leaderboard is actively being gamed (Goodhart's law strikes again), but the guy who wrote the Medium post is active in doing stuff with models, and the tutorial is (clearly) pretty good.


Well, "Amateur following tutorial games HF leaderboard to become the #1 model" seems just as impressive to me.


The only leaderboard this model is good on is the HuggingFace LLM Leaderboard which is known to be manipulated and victim to gross overfit. The Lmsys Arena Leaderboard is a better representation of the best models.


Thank you!!! _And_ it has the proprietary models...insanely more useful.

It's a bug, not a feature, that the stock leaderboard ends up with endless fine tunes, and as you point out and is demonstrated by the article, its more about something else than about quality


Even that chatbot arena shows that many models freely available and open source are better than some versions of GPT-3.5 and are within a small stones throw of the latest GPT 3.5


Note that it only includes gpt-3.5-turbo (the current iteration), not the original gpt-3.5. It's not exactly a secret that "turbo" models are noticeably dumber than the originals, whatever OpenAI says. There's no free lunch - that's why it's so much cheaper and faster...

That said, we do have public 120b models now that genuinely feel better than the original gpt-3.5.

The holy grail remains beating gpt-4 (or even gpt-4-turbo). This seems to be out of reach on consumer hardware at least...


Um, the lmsys elo ranking clearly shows that GPT-4 Turbo is better than GPT-4.


Same for the ChatGPTs post launch (let's not talk about 11_02 :) )

-- and as long as we're asserting andecdotes freely, I work in the field and have a couple years in before ChatGPT -- it most certainly is not a well-kept secret or a secret or true or anything else other than standard post-millenial self-peasantization.

"Outright lie" is kinder toward the average reader via being more succinct, but usually causes explosive reactions because people take a while to come to terms with their ad-hoc knowledge via consuming commentary is fundamentally flawed, if ever.


That just goes to show how useless the rankings are in general. If you actually use it, you'll quickly notice that older GPT-4 models are noticeably better at tasks that require reasoning.

gpt-4-turbo also has an extremely annoying quirk where instead of producing a complete response, it tends to respond with "you can do it like this: blah blah ...; fill in the blanks as needed", where the blank is literally the most important part of the response. It can sometimes take 3-4 rounds to get it to actually do what it's supposed to do.

But it does produce useless output much faster indeed.


This isn't true, I'm sorry. That may be your experience with it but it's not about the model. Using it is my day job and I've never ever seen that language.

It's frustrating for both of us, I assume.

I'm tired of people fact-free asserting it got worse because don't you know other people saw it got worse? And it did something bad the other day.

You're tired of the thing not doing the thing and you have observed it no longer does the thing. And you certainly shouldn't need to retain past prompts just to prove it.


Note that 'no free lunch' has a specific meaning with no relation whatsoever to model size/quality trade-offs...

https://en.wikipedia.org/wiki/No_free_lunch_theorem

In the speed/quality trade-off sense, there have /often/ been free lunches in many areas of computer science, where algorithmic improvements let us solve problems orders of magnitude faster. We don't fully understand what further improvements will be available for LLMs.


That phrase comes from the more general adage though.

https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch


Free lunch here relates to pricing/speed I would say, because gpt-4 and gpt-4-turbo are sold together. If gpt-4-turbo is cheaper, faster and has much larger context window, why would it make sense to also sell gpt-4... Unless it's a marketing trick or perhaps for backwards compatibility, which could also be.


It's interesting that despite that, phi-2 is still way out in front of the 3B set on HF. I was convinced something would have caught up by now.


Nobody really trusts the leaderboards anymore. Overfitted and in some cases just outright cheating by training against common test sets

https://twitter.com/karpathy/status/1737544497016578453


Are we sure the tutorial was medium? It might have been quite good, or at least above average. Ba dum tss.

“medium” should be capitalised in the title, as it refers to the blogging platform.

https://medium.com


Agreed, because I initially read it as "mediocre" rather than the brand.


And here I thought he was helped by a crystal ball...


Very interesting this was managed with a 6 months out of date course.


On a free collab instance...


The GPU he used is not free, you need to buy compute units for these "premium GPUs".

https://colab.research.google.com/signup


How much more low hanging fruit is there?


Is it possible that they fine-tuned on the leaderboard test set?


Who is using 7Bs in a serious manner, instead of OpenAI, in a cost efficient way?


1. fine tune to specific tasks 2. Not subject to OpenAI’s censorshi 3. Can run on local instead of cloud compute (offline) 4. experimentation


Not sure what you’re responding to


Using Mistral ft optimized 1218 7b q5km - very useful for basic queries/creative input. Often just as useful as ChatGPT and feels far more "real" to have it fully local, don't want to depend fully on one proprietary service for something as fundamentally useful as this!


I have spent more time than I would like to admit trying to get 3B-34B models working for "serious" usecases like RAG, code comprehension/generation and text summarization, across dozens of "leading" free LLMs, and none of them (not even Mixtral MoE or its derivatives) come close to gpt-3.5-turbo for consistency and depth, and there is no equal for gpt-4(-turbo).

LLM Leaderboards and benchmarks do not show the complete picture for YOUR specific usecase.


If you can post results that would be extremely extremely helpful. This is what I've been looking for -real world performance - rather than gamed leaderboards


That's just it: It is highly dependent on your inputs, expectations and usecase. My observations below.

Mixtral does about average at summarization. Not enough details are presented in the summary. Passable for non-decisionbound workflows. Using quantized models injects occasional spelling errors, so fall back on Mistral unquantized. Skip all the Mistral finetunes. Only choose Instruct models.

There are no good local code comprehension/generation models (and that includes CodeLlama, DeepSeek, Starcoder, WizardCoder, Phind). You will spend as much time guiding/rerolling/correcting output as coding it yourself, assuming your task is nontrivial and exceeds 100+ LoC. Code completion via continue.dev using local models did not yield good results, either.

No local models are good at agentic tasks (e.g., via AutoGen). You might have better luck rolling your own with function-calling via NexusRaven2 than using Agent frameworks with local LLMs.

Claude-Instant edges out gpt-3.5-turbo for longform summarization with its 100k context. It is what I use for https://HackYourNews.com because chunked/rolling summarization is noisy.

For every non-hobby task, I am switching to gpt-4-*, choosing between base, 32k, and turbo depending on speed/correctness/cost/length tradeoffs. They are not perfect, but there is no competition.


Would you mind sharing how you got access to Claude? I applied a long time ago and have never heard back from them.


thats really useful insight.

Yes it is usecase specific but thats also why its crucial for real world trial results to be shared so that theres a better understanding thats qualitatively accurate.

P.S: hackyournews.com is great!!


And this is my drawn out experience as well


If the end goal is document classification and/or semantic search, the Reexpress Fast I model (3.2 billion parameters) is a good choice. The key is that it produces reliable uncertainty estimates (for classification), so you know if you need a larger (or alternative) model. (In fact, an argument can be made that since the other models don't produce such uncertainty estimates, they are not ideal for serious use cases without adding an additional mechanism, such as ensembling with the Reexpress model.)


It seems that a lot of the leaders are the results of mixing finetunes, which really makes me think that there was a leak of test sets into the training data.


To reply to myself. I am not saying that this model did, or that even if it did, that it was done intentionally. ML is hard, and there are so many ways for data to leak.

What I AM surprised about is that it is not clear what CultriX did that was better than what a ton of others have done.

Any clues?


"Leaderboards", meh.

This tweet is still very true: https://twitter.com/karpathy/status/1737544497016578453


Goodhart's law would seem to apply:

https://en.wikipedia.org/wiki/Goodhart%27s_law

Nevertheless, scoring so well on this benchmark is an accomplishment, though I'm not in a position to evaluate how significant it is.


Nothing beats an actual human spending a couple hours with the model when it comes to meaningful evaluation.


that's why the huggingface llm arena exists


Good thread I saw on Reddit about this a few days ago.

https://www.reddit.com/r/LocalLLaMA/comments/18xbevs/open_ll...

Many top models are overfitting to top leaderboards rather than be actually useful.


It seems so easy for just one poisoned model (trained with test data) to infect a ton of finetune model mixtures...it could happen without intention.

Under this scenario, the ones that achieve the top performance are the closest relation to the poison model?


Why does huggingface list this as a 9B model?


it's trained with a lora adaptor, so it's either an error or they also count the adaptor. they use a 16 param inner lora dimension however, so it's unlikely that that's the reason (too small)

an important point to keep in mind is that at inference, the lora adaptors are made to be merged into the base model so they don't affect inference speed. (you need to explicitly do it though, if you train your own adaptor)


I'm as quick to jump on the Medium roastwagon as anyone else, but I will say Towards Data Science has a surprising number of quality tutorials running the full spectrum of data science tasks.

That and they have great SEO, you basically can't avoid them.


I avoid them easily using Kagi :))




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: