How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason
To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees
I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce
Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way
Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.
To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees