Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>We evaluate six frontier language models across both settings: Claude 3.5-Sonnet, GPT-4o, GPT-4o-Mini, Deepseek-v3, Gemini-2-Flash, and Llama-3.3-70B-Instruct.

While I appreciate the effort and creativity that went into this there are a lot of much simpler dynamic benchmarks that can let you saturate the planning capabilities of non-reasoning models.

Something as simple as giving a list of flight connections between cities and then asking for an itinerary between them confuses all these models when the shortest path between two nodes is long enough.

Longest shortest path the models could reliably find (8/10 tests for a given length) between two cities:

    | Model            | Path Length |
    |------------------+-------------|
    | Claude Sonnet3.5 |          10 |
    | GPT-4o           |           7 |
    | GPT-4o-mini      |           4 |
    | Deepseek-v3      |           6 |
    | Gemini-2-Flash   |  Not tested |
    | Llama3.3-70B-Ins |           4 |


This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.


That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: