Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a reasoning model.


Sure you can. "Reasoning" is ultimately an implementation detail, and the only thing that matters for capabilities is results, not process.


By "reasoning" I meant the fact that o*(-mini) does "chain-of-thought", in other words, it prompts itself to "reason" before responding to you, whereas GPT-4o(-mini) just directly responds to your prompt. Thus, it is not appropriate to compare o*(-mini) and GPT-4o(-mini) unless you implement "chain-of-thought" for GPT-4o(-mini) and compare that with o*(-mini). See also: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...


That's like saying you can't compare a sedan to a truck.

Sure you can.

Even though one is more appropriate for certain tasks than the other.


It is a nuanced point but what is better, a sedan or a truck? I think we are still at that stage of the conversation so it doesn't make much sense.

I do think it is a good metaphor for how all this shakes out though in time.


Yes you use the models for the same things, and one is better than the other for said thing. The reasoning process is an implementation detail that does not concern anybody when evaluating the models, esp since "open"ai does not expose it. I just want llms to to task X which is usually "write a function in Y language that does W, taking these Z stuff into account", and for that i have found no reason to switch away from sonnet yet.


Why can't you ask both questions (on a variety of topics etc), and grade the answers vs an ideal answer?

Ends before means.

If 4o answered better than o3, would you still use 03 for your task just because you were told it can "reason"?


The point is that you cannot make a general statement that “o1 is better than 4o.”


Yes, but because you need to say exactly what one is better than the other for. Not because o1 spends a bunch of tokens for "reasoning" you cannot even see.


If you would like to see the CoT process visualized, try the “Improve prompt” feature in Anthropic console. Also check out https://github.com/getAsterisk/deepclaude


o-whatever are doing the same thing as any LLM, it's merely that they've been tuned into using a chain of thought to break out of their complexity class (from pattern matching TC0 to pseudo-UTM). But any foundation model with a bit of instruction tuning is going to be able to do this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: