A couple of days ago, inspired by Simon and those discussions, I had Claude create 30 such tests. I posted a Show HN with the results from six models, but it didn’t get any traction. Here it is again:
Oh man, that’s hilarious. I dunno what qwen is doing most of the time. Gemini seems to be either a perfect win or complete nonsense. Claude seems to lean towards “good enough”.
https://news.ycombinator.com/item?id=45845717
https://gally.net/temp/20251107pelican-alternatives/index.ht...