We do actually track this. Each action an agent takes consumes 'ticks', which is how long it would take a human player to do it. For example, moving 10 tiles to the right takes something like 300 ticks (5 seconds).
However, our results only loosely indicate that smarter models use fewer ticks. I have an intuition that we will get more signal for tick-efficiency as models improve.
To an LLM those probably aren't even costs