Best 7B LLM on leaderboards made by an amateur following a medium tutorial

barnabee · on Jan 5, 2024

If only all people and companies were as honest as this guy about how much of their success they owe to luck and others!

belter · on Jan 6, 2024

If only VCs would be as smart...

barnabee · on Jan 6, 2024

And I thought I was being unrealistic

rck · on Jan 5, 2024

Everyone knows the hf leaderboard is actively being gamed (Goodhart's law strikes again), but the guy who wrote the Medium post is active in doing stuff with models, and the tutorial is (clearly) pretty good.

josu · on Jan 6, 2024

Well, "Amateur following tutorial games HF leaderboard to become the #1 model" seems just as impressive to me.

nabakin · on Jan 5, 2024

The only leaderboard this model is good on is the HuggingFace LLM Leaderboard which is known to be manipulated and victim to gross overfit. The Lmsys Arena Leaderboard is a better representation of the best models.

refulgentis · on Jan 5, 2024

Thank you!!! _And_ it has the proprietary models...insanely more useful.

It's a bug, not a feature, that the stock leaderboard ends up with endless fine tunes, and as you point out and is demonstrated by the article, its more about something else than about quality

Der_Einzige · on Jan 5, 2024

Even that chatbot arena shows that many models freely available and open source are better than some versions of GPT-3.5 and are within a small stones throw of the latest GPT 3.5

int_19h · on Jan 5, 2024

Note that it only includes gpt-3.5-turbo (the current iteration), not the original gpt-3.5. It's not exactly a secret that "turbo" models are noticeably dumber than the originals, whatever OpenAI says. There's no free lunch - that's why it's so much cheaper and faster...

That said, we do have public 120b models now that genuinely feel better than the original gpt-3.5.

The holy grail remains beating gpt-4 (or even gpt-4-turbo). This seems to be out of reach on consumer hardware at least...

modeless · on Jan 5, 2024

Um, the lmsys elo ranking clearly shows that GPT-4 Turbo is better than GPT-4.

anoncareer0212 · on Jan 5, 2024

Same for the ChatGPTs post launch (let's not talk about 11_02 :) )

-- and as long as we're asserting andecdotes freely, I work in the field and have a couple years in before ChatGPT -- it most certainly is not a well-kept secret or a secret or true or anything else other than standard post-millenial self-peasantization.

"Outright lie" is kinder toward the average reader via being more succinct, but usually causes explosive reactions because people take a while to come to terms with their ad-hoc knowledge via consuming commentary is fundamentally flawed, if ever.

int_19h · on Jan 6, 2024

That just goes to show how useless the rankings are in general. If you actually use it, you'll quickly notice that older GPT-4 models are noticeably better at tasks that require reasoning.

gpt-4-turbo also has an extremely annoying quirk where instead of producing a complete response, it tends to respond with "you can do it like this: blah blah ...; fill in the blanks as needed", where the blank is literally the most important part of the response. It can sometimes take 3-4 rounds to get it to actually do what it's supposed to do.

But it does produce useless output much faster indeed.

refulgentis · on Jan 7, 2024

This isn't true, I'm sorry. That may be your experience with it but it's not about the model. Using it is my day job and I've never ever seen that language.

It's frustrating for both of us, I assume.

I'm tired of people fact-free asserting it got worse because don't you know other people saw it got worse? And it did something bad the other day.

You're tired of the thing not doing the thing and you have observed it no longer does the thing. And you certainly shouldn't need to retain past prompts just to prove it.

sdenton4 · on Jan 5, 2024

Note that 'no free lunch' has a specific meaning with no relation whatsoever to model size/quality trade-offs...

https://en.wikipedia.org/wiki/No_free_lunch_theorem

In the speed/quality trade-off sense, there have /often/ been free lunches in many areas of computer science, where algorithmic improvements let us solve problems orders of magnitude faster. We don't fully understand what further improvements will be available for LLMs.

pests · on Jan 5, 2024

That phrase comes from the more general adage though.

https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch

mewpmewp2 · on Jan 5, 2024

Free lunch here relates to pricing/speed I would say, because gpt-4 and gpt-4-turbo are sold together. If gpt-4-turbo is cheaper, faster and has much larger context window, why would it make sense to also sell gpt-4... Unless it's a marketing trick or perhaps for backwards compatibility, which could also be.

regularfry · on Jan 5, 2024

It's interesting that despite that, phi-2 is still way out in front of the 3B set on HF. I was convinced something would have caught up by now.

Havoc · on Jan 6, 2024

Nobody really trusts the leaderboards anymore. Overfitted and in some cases just outright cheating by training against common test sets

https://twitter.com/karpathy/status/1737544497016578453

latexr · on Jan 5, 2024

Are we sure the tutorial was medium? It might have been quite good, or at least above average. Ba dum tss.

“medium” should be capitalised in the title, as it refers to the blogging platform.

https://medium.com

ben_w · on Jan 5, 2024

Agreed, because I initially read it as "mediocre" rather than the brand.

idorube · on Jan 5, 2024

And here I thought he was helped by a crystal ball...

brcmthrowaway · on Jan 5, 2024

Very interesting this was managed with a 6 months out of date course.

sour-taste · on Jan 5, 2024

On a free collab instance...

SushiHippie · on Jan 5, 2024

The GPU he used is not free, you need to buy compute units for these "premium GPUs".

https://colab.research.google.com/signup

brcmthrowaway · on Jan 5, 2024

How much more low hanging fruit is there?

cjbprime · on Jan 5, 2024

Is it possible that they fine-tuned on the leaderboard test set?

ramoz · on Jan 5, 2024

Who is using 7Bs in a serious manner, instead of OpenAI, in a cost efficient way?

justinl33 · on Jan 5, 2024

1. fine tune to specific tasks 2. Not subject to OpenAI’s censorshi 3. Can run on local instead of cloud compute (offline) 4. experimentation

ramoz · on Jan 6, 2024

Not sure what you’re responding to

dingdingdang · on Jan 5, 2024

Using Mistral ft optimized 1218 7b q5km - very useful for basic queries/creative input. Often just as useful as ChatGPT and feels far more "real" to have it fully local, don't want to depend fully on one proprietary service for something as fundamentally useful as this!

ukuina · on Jan 5, 2024

I have spent more time than I would like to admit trying to get 3B-34B models working for "serious" usecases like RAG, code comprehension/generation and text summarization, across dozens of "leading" free LLMs, and none of them (not even Mixtral MoE or its derivatives) come close to gpt-3.5-turbo for consistency and depth, and there is no equal for gpt-4(-turbo).

LLM Leaderboards and benchmarks do not show the complete picture for YOUR specific usecase.

dcreater · on Jan 6, 2024

If you can post results that would be extremely extremely helpful. This is what I've been looking for -real world performance - rather than gamed leaderboards

ukuina · on Jan 6, 2024

That's just it: It is highly dependent on your inputs, expectations and usecase. My observations below.

Mixtral does about average at summarization. Not enough details are presented in the summary. Passable for non-decisionbound workflows. Using quantized models injects occasional spelling errors, so fall back on Mistral unquantized. Skip all the Mistral finetunes. Only choose Instruct models.

There are no good local code comprehension/generation models (and that includes CodeLlama, DeepSeek, Starcoder, WizardCoder, Phind). You will spend as much time guiding/rerolling/correcting output as coding it yourself, assuming your task is nontrivial and exceeds 100+ LoC. Code completion via continue.dev using local models did not yield good results, either.

No local models are good at agentic tasks (e.g., via AutoGen). You might have better luck rolling your own with function-calling via NexusRaven2 than using Agent frameworks with local LLMs.

Claude-Instant edges out gpt-3.5-turbo for longform summarization with its 100k context. It is what I use for https://HackYourNews.com because chunked/rolling summarization is noisy.

For every non-hobby task, I am switching to gpt-4-*, choosing between base, 32k, and turbo depending on speed/correctness/cost/length tradeoffs. They are not perfect, but there is no competition.

m_kos · on Jan 6, 2024

Would you mind sharing how you got access to Claude? I applied a long time ago and have never heard back from them.

dcreater · on Jan 7, 2024

thats really useful insight.

Yes it is usecase specific but thats also why its crucial for real world trial results to be shared so that theres a better understanding thats qualitatively accurate.

P.S: hackyournews.com is great!!

ramoz · on Jan 6, 2024

And this is my drawn out experience as well

reexpressionist · on Jan 6, 2024

If the end goal is document classification and/or semantic search, the Reexpress Fast I model (3.2 billion parameters) is a good choice. The key is that it produces reliable uncertainty estimates (for classification), so you know if you need a larger (or alternative) model. (In fact, an argument can be made that since the other models don't produce such uncertainty estimates, they are not ideal for serious use cases without adding an additional mechanism, such as ensembling with the Reexpress model.)

SubiculumCode · on Jan 5, 2024

It seems that a lot of the leaders are the results of mixing finetunes, which really makes me think that there was a leak of test sets into the training data.

SubiculumCode · on Jan 5, 2024

To reply to myself. I am not saying that this model did, or that even if it did, that it was done intentionally. ML is hard, and there are so many ways for data to leak.

What I AM surprised about is that it is not clear what CultriX did that was better than what a ton of others have done.

Any clues?

wavemode · on Jan 5, 2024

"Leaderboards", meh.

This tweet is still very true: https://twitter.com/karpathy/status/1737544497016578453

not2b · on Jan 5, 2024

Goodhart's law would seem to apply:

https://en.wikipedia.org/wiki/Goodhart%27s_law

Nevertheless, scoring so well on this benchmark is an accomplishment, though I'm not in a position to evaluate how significant it is.

int_19h · on Jan 5, 2024

Nothing beats an actual human spending a couple hours with the model when it comes to meaningful evaluation.

make3 · on Jan 5, 2024

that's why the huggingface llm arena exists

politelemon · on Jan 5, 2024

Good thread I saw on Reddit about this a few days ago.

https://www.reddit.com/r/LocalLLaMA/comments/18xbevs/open_ll...

Many top models are overfitting to top leaderboards rather than be actually useful.

SubiculumCode · on Jan 5, 2024

It seems so easy for just one poisoned model (trained with test data) to infect a ton of finetune model mixtures...it could happen without intention.

Under this scenario, the ones that achieve the top performance are the closest relation to the poison model?

jasonjmcghee · on Jan 5, 2024

Why does huggingface list this as a 9B model?

make3 · on Jan 5, 2024

it's trained with a lora adaptor, so it's either an error or they also count the adaptor. they use a 16 param inner lora dimension however, so it's unlikely that that's the reason (too small)

an important point to keep in mind is that at inference, the lora adaptors are made to be merged into the base model so they don't affect inference speed. (you need to explicitly do it though, if you train your own adaptor)

mbb70 · on Jan 5, 2024

I'm as quick to jump on the Medium roastwagon as anyone else, but I will say Towards Data Science has a surprising number of quality tutorials running the full spectrum of data science tasks.

That and they have great SEO, you basically can't avoid them.

behnamoh · on Jan 5, 2024

I avoid them easily using Kagi :))