It's definitely *interesting* how the comments from right after the models were ...

It's definitely interesting how the comments from right after the models were released were ecstatic about "SOTA performance" and how it is "equivalent to o3" and then comments like yours hours later after having actually tested it keep pointing out how it's garbage compared to even the current batch of open models let alone proprietary foundation models.

Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.

These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.