The 20b solved the wolf, goat, cabbage river crossing puzzle set to high reasoning for me without needing to use a system prompt that encourages critical thinking. It managed it using multiple different recommended settings, from temperatures of 0.6 up to 1.0, etc.
Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.
I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.
Think this is the smallest model I've seen solve it.
Maybe both? I tried using different animals, scenarios, solvable versions, unsolvable versions, it gave me the correct answer with high reasoning in LM Studio. It does tell me it's in the training data, but it does reason through things fairly well. It doesn't feel like it's just reciting the solution and picks up on nuances around the variations.
If I switch from LM Studio to Ollama and run it using the CLI without changing anything, it will fail and it's harder to set the reasoning amount. If I use the Ollama UI, it seems to do a lot less reasoning. Not sure the Ollama UI has an option anywhere to adjust the system prompt so I can set the reasoning to high. In LM Studio even with the Unsloth GGUF, I can set the reasoning to high in the system prompt even though LM Studio won't give you the reasoning amount button to choose it with on that version.
In case of the river puzzle there is a huge difference between repeating an answer that you read somewhere and figuring it out on your own, one requires reasoning the other does not. If you swap out the animals involved, then you need some reasoning to recognize the identical structure of the puzzles and map between the two sets of animals. But you are still very far from the amount of reasoning required to solve the puzzle without already knowing the answer.
You can do it brute force, that requires again more reasoning than mapping between structurally identical puzzles. And finally you can solve it systematically, that requires the largest amount of reasoning. And in all those cases there is a crucial difference between blindly repeating the steps of a solution that you have seen before and coming up with that solution on your own even if you can not tell the two cases apart by looking at the output which would be identical.
As mgoetzke challenges, change the names of the items to something different, but the same puzzle. If it fails with "fox, hen, seeds" instead of "wolf, goat, cabbage" then it wasn't reasoning or applying something learned to another case. It was just regurgitating from the training data.
> Können Sie diesen Satz lesen, da er in Base-64-kodiertem Deutsch vorliegt? Haben Sie die Antwort von Grund auf erschlossen oder haben Sie nur Base 64 erkannt und das Ergebnis dann in Google Translate eingegeben? Was ist überhaupt „reasoning“, wenn man nicht das Gelernte aus einem Fall auf einen anderen anwendet?
>
> Can you read this sentence, since it's in Base-64 encoded German? Did you deduce the answer from scratch, or did you just recognize Base 64 and then enter the result into Google Translate? What is "reasoning" anyway if you don't apply what you've learned from one case to another?
Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.
I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.
Think this is the smallest model I've seen solve it.