You cannot compare GPT-4o and o\*(-mini) because GPT-4o is not a reasoning model...

lxgr · on Jan 31, 2025

Sure you can. "Reasoning" is ultimately an implementation detail, and the only thing that matters for capabilities is results, not process.

koakuma-chan · on Jan 31, 2025

By "reasoning" I meant the fact that o*(-mini) does "chain-of-thought", in other words, it prompts itself to "reason" before responding to you, whereas GPT-4o(-mini) just directly responds to your prompt. Thus, it is not appropriate to compare o*(-mini) and GPT-4o(-mini) unless you implement "chain-of-thought" for GPT-4o(-mini) and compare that with o*(-mini). See also: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

wordpad25 · on Jan 31, 2025

That's like saying you can't compare a sedan to a truck.

Sure you can.

Even though one is more appropriate for certain tasks than the other.

dutchbookmaker · on Jan 31, 2025

It is a nuanced point but what is better, a sedan or a truck? I think we are still at that stage of the conversation so it doesn't make much sense.

I do think it is a good metaphor for how all this shakes out though in time.

freehorse · on Feb 1, 2025

Yes you use the models for the same things, and one is better than the other for said thing. The reasoning process is an implementation detail that does not concern anybody when evaluating the models, esp since "open"ai does not expose it. I just want llms to to task X which is usually "write a function in Y language that does W, taking these Z stuff into account", and for that i have found no reason to switch away from sonnet yet.

scrollop · on Feb 1, 2025

Why can't you ask both questions (on a variety of topics etc), and grade the answers vs an ideal answer?

Ends before means.

If 4o answered better than o3, would you still use 03 for your task just because you were told it can "reason"?

koakuma-chan · on Feb 1, 2025

The point is that you cannot make a general statement that “o1 is better than 4o.”

freehorse · on Feb 1, 2025

Yes, but because you need to say exactly what one is better than the other for. Not because o1 spends a bunch of tokens for "reasoning" you cannot even see.

koakuma-chan · on Feb 1, 2025

If you would like to see the CoT process visualized, try the “Improve prompt” feature in Anthropic console. Also check out https://github.com/getAsterisk/deepclaude

inciampati · on Feb 1, 2025

o-whatever are doing the same thing as any LLM, it's merely that they've been tuned into using a chain of thought to break out of their complexity class (from pattern matching TC0 to pseudo-UTM). But any foundation model with a bit of instruction tuning is going to be able to do this.