I agree. There's some faux confidence about ChatGPT, this emotionally draining, ruthlessly authoritative prose that is just exhausting to intellectually engage with. It started around o3, and I have no idea what OpenAI is doing to make their models sound like this. Claude and Gemini models have a much more human tone to them.
"This is a classic ChstGPT gotcha". This and the gaslighting "Exactly, now you see why A!=B" when it was ME who pointed out his wrong A=B assumption are driving me crazy.
They f*cked it up. I am convinced ChatGPT will be a classic case of an early prodigy which gets surpassed by the better, second generation products. History is full of those. I think Tesla is another, recent one.
Indeed, while it's anecdotal, I find Gemini give me right answers at first attempt, while sometimes I can't get ChatGPT to get it right in two-three comebacks.
> A lot of tech companies will find the institutional knowledge they thought would shore up their moat is worth a lot less than they thought.
I totally agree. I think going forward the primary value of SAS will be the embedded domain expertise in a pre-built product. The comparison of Asana versus Notion comes to mind for project management. Asana forces abstractions of good project management upon you, whereas Notion lets you build it yourself. I think this principle will scale to all software in the future, where the only real value of software or it becomes exported maintenance obligations and a predetermines domain abstraction.
But as you mentioned, I think companies will rapidly find that their own specific abstraction is worth a lot less than they believed.
I agree with this completely. I forsee an era of enterprise level 'template' saas products that are expected to be tinkered with and highly customized. I think products like Notion that have an incredibly robust customizability and integration layer are going to thrive, where every single company can use a template engine to build extremely customized applications - and the barrier to building on top of these will essentially become the rate of human speech.
I share your observations. It's strange to see Anthropic loosing so much ground so fast - they seemed to be the first to crack long-horizon agentic tasks via what I can only assume is an extremely exotic RL process.
Now, I will concede that for non-coding long-horizon tasks, GPT-5 is marginally worse than Sonnet 4.5 in my own scaffolds. But GPT-5 is cheaper, and Sonnet 4.5 is about 2 months newer. However, for coding in a CLI context, GPT-5-Codex is night-and-day better. I don't know how they did it.
Every since 4.5, I can't get Claude to do anything that takes a while
4.0 would chug a long for 40 mins. 4.5 refuses and straight up says the scope is too big sometimes.
My theory is anthropic is super compute constrained and even though 4.5 is smarter, the usage limits and it's obsession with rushing to finish was put in mainly to save their servers compute.
I totally agree. I remember the June magic as well - almost overnight my abilities and throughput were profoundly increased, I had many weeks of late nights in awe and wonder trying things that were beyond my ability to implement technically but within the bounds of my conceptual understanding.
Initially, I found Codex CLI with GPT-5 to be a substitute for Claude Code - now GPT-5 Codex materially surpasses it in my line of work, with a huge asterisk. I work in a niche industry, and Codex has generally poor domain understanding of many of the critical attributes and concepts. Claude happens to have better background knowledge for my tasks, so I've found that Sonnet 4.5 with Claude Code generally does a better job at scaffolding any given new feature. Then, I call in Codex to implement actual functionality since Codex does not have the "You're absolutely right" and mocked/placeholder implementation issues of CC, and just generally writes clean, maintainable, well-planned code. It's the first time I've ever really felt the whole "it's as good as a senior engineer" hype - I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.
I think Codex is a generally better product with better pricing, typically 40-50% cheaper for about the same level of daily usage for me compared to CC. I agree that it will take a genuinely novel and material advancement to dethrone Codex now. I think the next frontier for coding agents is speed. I would use CC over Codex if it was 2x or 3x as fast, even at the same quality level. Otherwise, Codex will remain my workhorse.
When I was in high school, I would see the algebra teacher work through expressions and go "ohhh, that makes sense". But when I got back home to work with the homework, I couldn't make the pieces fit.
Isn't that the same? Just because you recognize something someone else wrote and makes you go "ohh, I understand it conceptually" doesn't mean that you can apply that concept in a few days or weeks.
So when the person you responded to says:
>almost overnight *my abilities* and throughput were profoundly increased
I'd argue the throughput did but his abilities really weren't, because without the tool in question you're just as good as before the tool. To truly claim that his abilities were profoundly increased, he has to be able to internalize the pattern, recognize the pattern, and successfully reproduce it across variable contexts.
Another example would be claiming that my painting abilities and throughput were profoundly increased, because I used to draw stick figures and now I can draw Yu-Gi-Oh! cards by using the tool. My throughput was really increased, but my abilities as a painter really haven't.
>I think, in most cases, GPT5-Codex finally is as good as a senior engineer for my specific use case.
This is beyond bananas to me given that I regularly see codex high and Gpt-5-high both fail to create basic react code slightly off the normal distribution.
That might say something about the understandability of the react framework/paradigm ;)
Quality varies a lot based on what you're doing, how you prompt it, how you orchestrate it, and how you babysit and correct it. I haven't seen anything I'd call senior, but I have seen it, for some classes of tasks, turn this particular engineer into many seniors. I still have to supply all the heavy lifting (here's the concurrency model, how you'll ensure exactly-once-delivery, particular functions and classes you definitely want, a few common pitfalls to avoid, etc), but then it can flesh out the details extremely well.
If you really want to see it fail at something easy, try to have write something that can use JSX but doesn't use React (Bun, Hono, etc). Seems like no amount of context management and detailed instructions will keep it from reaching for React-isms.
Do you mind if I ask what kind of React code you're working on? I've had good success using Codex for my frontend development, especially since all of my projects consistently rely on a pretty widely used and well documented component library. I realize that makes my use case fairly narrow, so I don't think I've discovered the limits you have.
Today I was trying to get it to temporarily shim in for development and consume the value of a redux store via merely putting a default in the reducer. Depending on that value, the application would present different state.
It failed to accomplish this and added a disgusting amount of defensive nonsense code in my saga, reducer and component to ensure the value was there. It took me a very short time to correct it but just watching it completely fail at this task was borderline absurd.
Thanks for the context! I feel the same way. When it fails it fails hard. This is why I'm extremely skeptical of any of the non-cli cloud solutions - as you observed, I think the failures compound and cascade if you don't stop them early, which requires a compelling interface and the ability to manually intervene very fast.