The only leaderboard this model is good on is the HuggingFace LLM Leaderboard which is known to be manipulated and victim to gross overfit. The Lmsys Arena Leaderboard is a better representation of the best models.
Thank you!!! _And_ it has the proprietary models...insanely more useful.
It's a bug, not a feature, that the stock leaderboard ends up with endless fine tunes, and as you point out and is demonstrated by the article, its more about something else than about quality
Even that chatbot arena shows that many models freely available and open source are better than some versions of GPT-3.5 and are within a small stones throw of the latest GPT 3.5
Note that it only includes gpt-3.5-turbo (the current iteration), not the original gpt-3.5. It's not exactly a secret that "turbo" models are noticeably dumber than the originals, whatever OpenAI says. There's no free lunch - that's why it's so much cheaper and faster...
That said, we do have public 120b models now that genuinely feel better than the original gpt-3.5.
The holy grail remains beating gpt-4 (or even gpt-4-turbo). This seems to be out of reach on consumer hardware at least...
Same for the ChatGPTs post launch (let's not talk about 11_02 :) )
-- and as long as we're asserting andecdotes freely, I work in the field and have a couple years in before ChatGPT -- it most certainly is not a well-kept secret or a secret or true or anything else other than standard post-millenial self-peasantization.
"Outright lie" is kinder toward the average reader via being more succinct, but usually causes explosive reactions because people take a while to come to terms with their ad-hoc knowledge via consuming commentary is fundamentally flawed, if ever.
That just goes to show how useless the rankings are in general. If you actually use it, you'll quickly notice that older GPT-4 models are noticeably better at tasks that require reasoning.
gpt-4-turbo also has an extremely annoying quirk where instead of producing a complete response, it tends to respond with "you can do it like this: blah blah ...; fill in the blanks as needed", where the blank is literally the most important part of the response. It can sometimes take 3-4 rounds to get it to actually do what it's supposed to do.
But it does produce useless output much faster indeed.
This isn't true, I'm sorry. That may be your experience with it but it's not about the model. Using it is my day job and I've never ever seen that language.
It's frustrating for both of us, I assume.
I'm tired of people fact-free asserting it got worse because don't you know other people saw it got worse? And it did something bad the other day.
You're tired of the thing not doing the thing and you have observed it no longer does the thing. And you certainly shouldn't need to retain past prompts just to prove it.
In the speed/quality trade-off sense, there have /often/ been free lunches in many areas of computer science, where algorithmic improvements let us solve problems orders of magnitude faster. We don't fully understand what further improvements will be available for LLMs.
Free lunch here relates to pricing/speed I would say, because gpt-4 and gpt-4-turbo are sold together. If gpt-4-turbo is cheaper, faster and has much larger context window, why would it make sense to also sell gpt-4... Unless it's a marketing trick or perhaps for backwards compatibility, which could also be.