"GPT-5 in the API is simpler: it’s available as three models—regular, mini and nano—which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high."
Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.
And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?
If reasoning is on the table, then you already had to add o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high, o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5 way seems simpler to me.
If you need world knowledge, then bigger models. If you need problem-solving, then more reasoning.
But the specific nuance of picking nano/mini/main and minimal/low/medium/high comes down to experimentation and what your cost/latency constraints are.
Trying to get an accurate answer (best correlated with objective truth) on a topic I don't already know the answer to (or why would I ask?). This is, to me, the challenge with the "it depends, tune it" answers that always come up in how to use these tools -- it requires the tools to not be useful for you (because there's already a solution) to be able to do the tuning.
If cost is no concern (as in infrequent one-off tasks) then you can always go with the biggest model with the most reasoning. Maybe compare it with the biggest model with no/less reasoning, since sometimes reasoning can hurt (just as with humans overthinking something).
If you have a task you do frequently you need some kind of benchmark. Which might just be comparing how good the output of the smaller models holds up to the output of the bigger model, if you don't know the ground truth
I agree. Public benchmarks aren't very useful for a bunch of reasons. Any company relying on LLMs for a critical function should have its own internal benchmark system. I maintain such a system for my job. If you are able, use the same prompt every time. It's fun to be able to include models like the original Bard on our leader board.
When I read “simpler” I interpreted that to mean they don’t use their Chat-optimized harness to guess which reasoning level and model to use. The subscription chat service (ChatGPT) and the chat-optimized model on their API seem to have a special harness that changes reasoning based on some heuristics, and will switch between the model sizes without user input.
With the API, you pick a model sizes and reasoning effort. Yes more choices, but also a clear mental model and a simple choice that you control.
Is it actually simpler? For those who are currently using GPT 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to at least 8, if we don't consider gpt 5 regular - we now will have to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt 5 nano medium and gpt 5 nano high.
And, while choosing between all these options, we'll always have to wonder: should I try adjusting the prompt that I'm using, or simply change the gpt 5 version or its reasoning level?