Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Leaderboards", meh.

This tweet is still very true: https://twitter.com/karpathy/status/1737544497016578453



Goodhart's law would seem to apply:

https://en.wikipedia.org/wiki/Goodhart%27s_law

Nevertheless, scoring so well on this benchmark is an accomplishment, though I'm not in a position to evaluate how significant it is.


Nothing beats an actual human spending a couple hours with the model when it comes to meaningful evaluation.


that's why the huggingface llm arena exists




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: