GPT-4 Turbo: 31.0 Claude 3 Opus: 27.3 Mistral Large: 17.7 Mistral Medium: 15.3 G...

GPT-4 Turbo: 31.0

Claude 3 Opus: 27.3

Mistral Large: 17.7

Mistral Medium: 15.3

Gemini Pro 1.0: 14.2

Qwen 1.5 72B Chat: 10.7

Claude 3 Sonnet: 7.6

GPT-3.5 Turbo: 4.2

Mixtral 8x7B Instruct: 4.2

Llama 2 70B Chat: 3.5

Nous Hermes 2 Yi 34B: 1.5

The interesting part is the large improvement from medium to large models. Existing over-optimized benchmarks don't show this.

- Max is 100. 267 puzzles, 3 prompts for each, uppercase and lowercase

- Partial credit is given if the puzzle is not fully solved

- There is only one attempt allowed per puzzle, 0-shot.

- Humans get 4 attempts and a hint when they are one step away from solving a group

I hoped to get the results of Gemini Advanced, Gemini Pro 1.5, and Grok and do a few-shot version before posting it on GitHub.