Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro 1.0: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.
- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase
- Partial credit is given if the puzzle is not fully solved
- There is only one attempt allowed per puzzle, 0-shot.
- Humans get 4 attempts and a hint when they are one step away from solving a group
I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.
Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro 1.0: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.
- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase
- Partial credit is given if the puzzle is not fully solved
- There is only one attempt allowed per puzzle, 0-shot.
- Humans get 4 attempts and a hint when they are one step away from solving a group
I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.