Noddy, what’s “fair game” for this benchmark? e.g. do you wish to provide frontier models with a text goal, tooling info, and leave it at that? Or do you wish to have agent architectures compete? It seems to me like tiering the goal setting, layout and implementation are all separate tasks that would benefit from different agents.
The idea is for us to track all frontier models using the basic agent (goal, tooling info), and then offer another leaderboard for different agent architectures (with retrieval etc).