Well, you're not wrong :)
Vercel is not the one to blame here, it's my skill issue. Entire thing was vibecoded by me — product manager with no production dev experience. Not to promote vibecoding, but I couldn't do it myself the other way.
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.
I noticed the same and think that you're absolutely right. I've thought about adding their current hand / draw, but it was too close to the event to test it properly.
That’s true.
The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining.
A proper benchmark would require things like:
- Tens of thousands of hands played
- Strict heads-up format (only two models compared at a time)
- Each hand played twice with positions swapped
The current setup is mainly useful for observing common reasoning failure modes and how often they occur.
Thank you! I’ll take a look at that. Honestly, building the game was part of the fun, so I didn’t look into open-source options.
The slides are in the repo and the recording will be published on the Python España YouTube channel in a couple of months (in Spanish):
https://www.youtube.com/@PythonES
Great job on this btw. I don’t mean to take away anything from your work. I’ve also toyed with AI H2H quite a bit for my personal needs. It’s actually a challenging task because you have to have a good understanding of the models you’re plugging in.
Well, you're not wrong :) Vercel is not the one to blame here, it's my skill issue. Entire thing was vibecoded by me — product manager with no production dev experience. Not to promote vibecoding, but I couldn't do it myself the other way.