Anecdata, but I'm still finding CC to be absolutely outstanding at writing code.
It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting, basically no "specs" - just giving it coherent sane direction: like to make sure it tests things in several different ways, for several different cases, including performance, comparing directly to similar implementations (and constantly triple-checking that it actually did what you asked after it said "done").
For $200/mo, I can still run 2-3 clients almost 24/7 pumping out features. I rarely clear my session. I haven't noticed quality declines.
Though, I will say, one random day - I'm not sure if it was dumb luck - or if I was in a test group, CC was literally doing 10x the amount of work / speed that it typically does. I guess strange things are bound to happen if you use it enough?
Related anecdata: IME, there has been a MASSIVE decline in the quality of claude.ai (the chatbot interface). It is so different recently. It feels like a wanna-be crapier version of ChatGPT, instead of what it used to be, which was something that tried to be factual and useful rather than conversational and addictive and sycophantic.
My anecdata is that it heavily depends on how much of the relevant code and instructions it can fit in the context window.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
> basically no "specs" - just giving it coherent sane direction
This is one variable I almost always see in this discussion: the more strict the rules that you give the LLM, the more likely it is to deeply disappoint you
The earlier in the process you use it (ie: scaffolding) the more mileage you will get out of it
It's about accepting fallability and working with it, rather than trying to polish it away with care
To me this still feels like it would be a net negative. I can scaffold most any project with a language/stack specific CLI command or even just checking out a repo.
And sure, AI could “scaffold” further into controllers and views and maybe even some models, and they probably work ok. It’s then when they don’t, or when I need something tweaked, that the worry becomes “do I really understand what’s going on under the hood? Is the time to understand that worth it? Am I going to run across a small thread that I end up pulling until my 80% done sweater is 95% loose yarn?”
To me the trade-off hasn’t proven worth it yet. Maybe for a personal pet project, and even then I don’t like the idea of letting something else undeterministically touch my system. “But use a VM!” they say, but that’s more overhead than I care for. Just researching the safest way to bootstrap this feels like more effort than value to me.
Lastly, I think that a big part of why I like programming is that I like the act of writing code, understanding how it works, and building something I _know_.
How can a person reconcile this comment with the one at the root of this thread? One person says Claude struggles to even meet the strict requirements of a spec sheet, another says Claude is doing a great job and doesn’t even need specific specs?
I have my own anecdata but my comment is more about the dissonance here.
One is getting paid by a marketing department program and the other isn't. Remember how much has been spent making LLMs and they have now decided that coding is its money maker. I expect any negative comment on LLM coding to be replied to by at least 2 different puppets or bots.
... or one person has a very strong mental model of what he expects to do, but the LLM has other ideas. FWIW I'm very happy with CC and Opus, but I don't treat it as a subordinate but as a peer; I leave it enough room to express what it thinks is best and guide later as needed. This may not work for all cases.
If you do spot checks, that is woefully inadequate. I have lost count of the number of times when, poring over code a SOTA LLM has produced, I notice a lot of subtle but major issues (and many glaring ones as well), issues a cursory look is unlikely to pick up on. And if you are spending more time going over the code, how is that a massive speed improvement like you make it seem?
And, what do you even mean by 10x the amount of work? I keep saying anybody that starts to spout these sort of anecdotes absolutely does NOT understand real world production level serious software engineering.
Is the model doing 10x the amount of simplification, refactoring, and code pruning an effective senior level software engineer and architect would do? Is it doing 10x the detailed and agonizing architectural (re)work that a strong developer with honed architectural instincts would do?
And if you tell me it's all about accepting the LLM being in the driver's seat and embracing vibe coding, it absolutely does NOT work for anything exceeding a moderate level of complexity. I used to try that several times. Up to now no model is able to write a simple markdown viewer with certain specific features I have wanted for a long time. I really doubt the stories people tell about creating whole compilers with vide coding.
If all you see is and appreciate that it is pumping out 10x features, 10x more code, you are missing the whole point. In my experience you are actually producing a ton of sh*t, sorry.
Honestly, this more of a question about scope of the application and the potential threat vectors.
If the GP is creating software that will never leave their machine(s) and is for personal usage only, I'd argue the code quality likely doesn't matter. If it's some enterprise production software that hundreds to millions of users depend on, software that manages sensitive data, etc., then I would argue code quality should asymptotically approach perfection.
However, I have many moons of programming under my belt. I would honestly say that I am not sure what good code even is. Good to who? Good for what? Good how?
I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
> I apply the Herbie Hancock philosophy when defining good code. When once asked what is Jazz music, Herbie responded with, "I can't describe it in words, but I know it when I hear it."
That’s the problem. If we had an objective measure of good code, we could just use that instead of code reviews, style guides, and all the other things we do to maintain code quality.
> I truly believe that most competent developers (however one defines competent) would be utterly appalled at the quality of the human-written code on some of the services they frequently use.
Not if you have more than a few years of experience.
But what your point is missing is the reason that software keeps working in the fist, or stays in a good enough state that development doesn’t grind to a halt.
There are people working on those code bases who are constantly at war with the crappy code. At every place I’ve worked over my career, there have been people quietly and not so quietly chipping away at the horrors. My concern is that with AI those people will be overwhelmed.
They can use AI too, but in my experience, the tactical tornadoes get more of a speed boost than the people who care about maintainability.
Not sure about ChatGPT, but Claude was (is still?) an absolute ripper at cracking some software if one has even a little bit of experience/low level knowledge. At least, that's what my friend told me... I would personally never ever violate any software ToA.
Have you ever had "banana flavor" candy that doesn't really taste like bananas? The flavoring is Isoamyl acetate, and I've heard suggestion that people called it banana flavor because it tasted more like Gros Michel. After switching to Cavendish banana the flavor name no longer made as much sense. Not sure how true it is though.
Someone in the thread linked to https://www.youtube.com/watch?v=I9ZtvpBoXzI, where Hank Green tells this same story... and tries a Gros Michel banana and says it doesn't taste like "banana flavor"
The bananas I had as a child back in 1960 had the strong flavor of isoamyl acetate along with the natural bouquet of related flavors in lesser amounts.
I hated it.
The space-age banana popsicles were even worse because they were nothing but isoamyl acetate.
You can buy a Gros Michel banana from Miami Fruit, although they are quite expensive (almost $40 for a single banana). There are reviews of the banana on YouTube as well - I highly recommend the Weird Explorer channel if you want video reviews of all sorts of strange fruit.
Most edible bananas are seedless and most cultivars (human grown) bananas are genetic mutants with triploid chromosomes (though a few are tetrapolid or diploid). Getting them to produce functional reproductive structures at all let alone viable seeds is very difficult. There are ongoing efforts to cross-breed with their wild cousins and to preserve genetic diversity.
> 2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.
I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)
You can ask the LLMs: how do I run this, how do I know this is working, etc etc.
Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.
You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.
But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.
What do you expect?
Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.
They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.
Good advice. Personally I'm waiting until it is worthwhile to run these models locally, then I'm going to pin a version and just use that.
I'm only 5 years into this career, and I'm going to work manually and absorb as much knowledge as possible while I'm still able to do it. Yes, that means manually doing shit-kicker work. If AI does get so good that I need to use it, as you say, then I'll be running it locally on a version I can master and build tooling for.
Would you care to elaborate on what you mean with concert fatigue? I've never heard of it and you're talking about it as though it's something that is so common, it is implied to be known.
> They produce drastically lower amount of tokens to solve a problem, but they haven't seem to have put enough effort into refinining their reasoning and execution as they produce broken toolcalls and generally struggle with 'agentic' tasks, but for raw problem solving without tools or search they match opus and gpt while presumably being a fraction of the size.
Agreed, Gemini-cli is terrible compared to CC and even Codex.
But Google is clearly prioritizing to have the best AI to augment and/or replace traditional search. That's their bread and butter. They'll be in a far better place to monetize that than anyone else. They've got a 1B+ user lead on anyone - and even adding in all LLMs together, they still probably have more query volume than everyone else put together.
I hope they start prioritizing Gemini-cli, as I think they'd force a lot more competition into the space.
> Agreed, Gemini-cli is terrible compared to CC and even Codex.
Using it with opencode I don't find the actual model to cause worse results with tool calling versus Opus/GPT. This could be a harness problem more than a model problem?
I do prefer the overall results with GPT 5.4, which seems to catch more bugs in reviews that Gemini misses and produce cleaner code overall.
(And no, I can't quantify any of that, just "vibes" based)
I wonder what I am missing, because I can use gemini-cli with English descriptions of features or entire projects and it just cranks out the code. Built a bunch of stuff with it. Can't think of anything it's currently lacking.
>> Can't think of anything it's currently lacking.
Speed? The pro models are slow for me
The model 3.1 pro model is good and i don't recognise the GP's complaint of broken tool calls but i'm only using via gemini cli harness, sounds like they might be hosting their own agentic loop?
I thought the same for a long time, borderline unusable with loops/bizarre decisions compared to Claude Code and later Codex.
But I picked it up again about a month ago and I have been quite impressed. Haven’t hit any of those frustrating QoL issues yet it was famous for and I’ve been using it a few hours a day.
Maybe it will let me down sooner or later but so far it has been working really well for me and is pretty snappy with the auto model selection.
After cancelling my Claude Pro plan months ago due to Anthropic enshittification I’ve been nervous relying solely on Codex in case they do the same, so I’ve been glad to have it available on my Google One plan.
Google doesn't need to give a shit, because so much of the internet is infested with with google ad trackers and adwords, and everybody uses Chrome, that they will continue to make billions even without AI. Facebook did the same with their pixel so they could soak up data.
Gemini will be dead in 2 years and there'll be something else, but the ad and search company will remain given that they basically own the world wide web.
Except now, so much of the WWW is filled with AI slop that it breaks the system.
Which ever shitty model they’re using for search is so much better than the free offerings from the other companies. It’s not even close. It’s not going anywhere.
As someone else pointed out... When commits are this cheap, if that's the metric to be gamed, it will be gamed.
You just create 5 GitHub accounts, and spread your Claude Code commits to 5 separate accounts to make it look like there's 5 active contributors.
If anything, we're better off with a fake star economy that is the main thing most people are trying to game, so the signal to noise can still be that it (at least so far) seems pretty easy to tell how many REAL active contributors there are.
Though, I should note, 2 heads are not always better than 1.
I'm more interested in a repository that has commits only from two geniuses than a repository that has 100s of morons contributing to it.
> Increased defense spending actually makes the US less, not more, safe.
It just makes us spend more money on defense, which is the entire point.
The industry obviously wants more and more profits.
They are never going to recommend getting rid of $200m F22s and replacing them with 30 $300k drones that would be more effective and cost 5% as much money.
That's 5% as much profit for them. They're not interested.
They are interested in profits, not national security.
And as you pointed out, they'd prefer a LESS secure world that inherently demands more money going to security.
You could spend more on security to actually be more secure. It's just that no one with any power is interested in that world.
Anecdata, but I'm still finding CC to be absolutely outstanding at writing code.
It's regularly writing systems-level code that would take me months to write by hand in hours, with minimal babysitting, basically no "specs" - just giving it coherent sane direction: like to make sure it tests things in several different ways, for several different cases, including performance, comparing directly to similar implementations (and constantly triple-checking that it actually did what you asked after it said "done").
For $200/mo, I can still run 2-3 clients almost 24/7 pumping out features. I rarely clear my session. I haven't noticed quality declines.
Though, I will say, one random day - I'm not sure if it was dumb luck - or if I was in a test group, CC was literally doing 10x the amount of work / speed that it typically does. I guess strange things are bound to happen if you use it enough?
Related anecdata: IME, there has been a MASSIVE decline in the quality of claude.ai (the chatbot interface). It is so different recently. It feels like a wanna-be crapier version of ChatGPT, instead of what it used to be, which was something that tried to be factual and useful rather than conversational and addictive and sycophantic.
reply