Apologies, I was a bit curt because this is a well-worn interaction pattern.
I don't mean anything by the following either, other than, the goalposts have moved:
- This doesn't say anything about generalization, nor does it claim to.
- The occurrences of the prefix general* refer to "Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?"
- This specific suggestion was accomplished publicly to some acclaim in September
- To wit, the benchmark the article is centered around hasn't been updated since since September, because the preview of the large model accomplishing that blew it out of the water, 33% on all at the time, 71%: https://huggingface.co/spaces/allenai/ZebraLogic
- these aren't supposed to be easy, they're constraint satisfaction problems, which they point out are used on the LSAT
- The major other form of this argument is the Apple paper, which shows a 5 point drop from 87% to 82% on a home-cooked model
https://huggingface.co/blog/yuchenlin/zebra-logic