>We evaluate six frontier language models across both settings: Claude 3.5-Sonne...

noddybear · 2025-03-12T13:28:50 1741786130

This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.

noosphr · 2025-03-12T19:43:00 1741808580

That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.