They don’t generalize well on logic puzzles https://huggingface.co/blog/yuchenli...

refulgentis · on Feb 9, 2025

Apologies, I was a bit curt because this is a well-worn interaction pattern.

I don't mean anything by the following either, other than, the goalposts have moved:

- This doesn't say anything about generalization, nor does it claim to.

- The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?"

- This specific suggestion was accomplished publicly to some acclaim in September

- To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic

- these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT

- The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model