The only leaderboard this model is good on is the HuggingFace LLM Leaderboard wh...

refulgentis · on Jan 5, 2024

Thank you!!! _And_ it has the proprietary models...insanely more useful.

It's a bug, not a feature, that the stock leaderboard ends up with endless fine tunes, and as you point out and is demonstrated by the article, its more about something else than about quality

Der_Einzige · on Jan 5, 2024

Even that chatbot arena shows that many models freely available and open source are better than some versions of GPT-3.5 and are within a small stones throw of the latest GPT 3.5

int_19h · on Jan 5, 2024

Note that it only includes gpt-3.5-turbo (the current iteration), not the original gpt-3.5. It's not exactly a secret that "turbo" models are noticeably dumber than the originals, whatever OpenAI says. There's no free lunch - that's why it's so much cheaper and faster...

That said, we do have public 120b models now that genuinely feel better than the original gpt-3.5.

The holy grail remains beating gpt-4 (or even gpt-4-turbo). This seems to be out of reach on consumer hardware at least...

modeless · on Jan 5, 2024

Um, the lmsys elo ranking clearly shows that GPT-4 Turbo is better than GPT-4.

anoncareer0212 · on Jan 5, 2024

Same for the ChatGPTs post launch (let's not talk about 11_02 :) )

-- and as long as we're asserting andecdotes freely, I work in the field and have a couple years in before ChatGPT -- it most certainly is not a well-kept secret or a secret or true or anything else other than standard post-millenial self-peasantization.

"Outright lie" is kinder toward the average reader via being more succinct, but usually causes explosive reactions because people take a while to come to terms with their ad-hoc knowledge via consuming commentary is fundamentally flawed, if ever.

int_19h · on Jan 6, 2024

That just goes to show how useless the rankings are in general. If you actually use it, you'll quickly notice that older GPT-4 models are noticeably better at tasks that require reasoning.

gpt-4-turbo also has an extremely annoying quirk where instead of producing a complete response, it tends to respond with "you can do it like this: blah blah ...; fill in the blanks as needed", where the blank is literally the most important part of the response. It can sometimes take 3-4 rounds to get it to actually do what it's supposed to do.

But it does produce useless output much faster indeed.

refulgentis · on Jan 7, 2024

This isn't true, I'm sorry. That may be your experience with it but it's not about the model. Using it is my day job and I've never ever seen that language.

It's frustrating for both of us, I assume.

I'm tired of people fact-free asserting it got worse because don't you know other people saw it got worse? And it did something bad the other day.

You're tired of the thing not doing the thing and you have observed it no longer does the thing. And you certainly shouldn't need to retain past prompts just to prove it.

sdenton4 · on Jan 5, 2024

Note that 'no free lunch' has a specific meaning with no relation whatsoever to model size/quality trade-offs...

https://en.wikipedia.org/wiki/No_free_lunch_theorem

In the speed/quality trade-off sense, there have /often/ been free lunches in many areas of computer science, where algorithmic improvements let us solve problems orders of magnitude faster. We don't fully understand what further improvements will be available for LLMs.

pests · on Jan 5, 2024

That phrase comes from the more general adage though.

https://en.wikipedia.org/wiki/No_such_thing_as_a_free_lunch

mewpmewp2 · on Jan 5, 2024

Free lunch here relates to pricing/speed I would say, because gpt-4 and gpt-4-turbo are sold together. If gpt-4-turbo is cheaper, faster and has much larger context window, why would it make sense to also sell gpt-4... Unless it's a marketing trick or perhaps for backwards compatibility, which could also be.

regularfry · on Jan 5, 2024

It's interesting that despite that, phi-2 is still way out in front of the 3B set on HF. I was convinced something would have caught up by now.