LLMs don't lack the virtue of laziness: it has it if you want it to, by just having a base prompt that matches intent. I've had good success convincing claude backed agents to aim for minimal code changes, make deduplication passes, and basically every other reasonable "instinct" of a very senior dev. It's not knowledge that the models haven't integrated, but one that many don't have on their forefront with default settings. I bet we've all seen the models that over-edit everything, and act like the crazy mid-level dev that fiddles with the entire codebase without caring one bit about anyone else's changes, or any risk of knowledge loss due to overfiddling.
And on Jess' comments on validating docs vs generating them... It's a traditional locking problem, with traditional solutions. And it's not as if the agent cannot read git, and realize when one thing is done first, in anticipation of the other by convention.
I'm quite senior: In fact, I have been a teammate of a couple of people mention in this article. I suspect that they'd not question my engineering standards. And yet I've no seen any of that kind of debt in my LLM workflows: if anything, by most traditional forms of evaluating software quality, the projects I work on are better than what they were 5, 10 years ago, using the same metrics as back then. And it's not magic or anything, but making sure there are agents running sharing those quality priorities. But I am getting work done, instead of spending time looking for attention in conferences.
> if anything, by most traditional forms of evaluating software quality, the projects I work on are better than what they were 5, 10 years ago, using the same metrics as back then.
In this side sentence you're introducing so much vagueness. Can you share insights to get some validation on your claim? What metrics are you using and how is your code from 10, 5, 0 years performing?
I feel throwing in a vague claim like that unnecessarily dilutes your message and distracts from the point. But, if you do have more to share I'd be curious to learn more.
Business logic is usually the most substantial part of legacy systems in my experience, so I imagine so.
Not to be too negative but a lot of modern software complexity is a prison of our own making, that we had time to build because our programs are actually pretty boring CRUD apps with little complex business logic.
I can only assume there's a ton of domain knowledge accrued over those years and beyond baked into the legacy code, that an LLM can just scoop up in a minute.
How much time to verify and validate that large corpus of code that they generated? Not including back and forth to get rid of hallucinations and other mistakes.
How does the code look? I am curious if there is proper usage of abstractions, or is logic just kind of all over the place?
Some part of me feels like LLM generated code is great if one cares about the solution, but leaves a lot to be desired if one actually cares about code quality. Then again, maybe I am just bad as using LLMs -- I prefer the chat over lettings LLMs do the work for me.
The anecdote the GP is providing there rings true for me too - although I'm not sure if I am going offer better detail.
I'm a proponent of architectural styles like MVC, SOLID, hexagonal architecture, etc, and in pre-LLM workflows, "human laziness" often led to technical debt: a developer might lazily leak domain logic into a controller or skip writing an interface just to save time.
The code I get the LLM to emit is a lot more compliant with those BUT there is a caveat that the LLMs do have a habit of "forgetting" the specific concerns of the given file/package/etc, and I frequently have to remind it.
The "metric" improvement isn't that the LLM is a better architect than a senior dev; it's that it reduces the cost of doing things the right way. The delta between "quick and dirty" and "cleanly architected" has shrunk to near zero, so the "clean" version becomes the path of least resistance.
I'm seeing less "temporary" kludges because the LLM almost blindly follows my requests
I don't think I'd like your code. But apparently there's enough implied YAGNI in my CLAUDE.md to prevent the unnecessary interfaces and layers of separation that you apparently like. So I guess there is a flavor for everyone.
I've recently had the interesting experience of working on a Clean Architecture project for the first time. Pre and post-LLM adoption.
It has been... difficult. Services/modules organised by infrastructural layer rather than by feature. A mediator pattern abstracted away the handling of commands. Just in case one day you needed CreateFooCommand to be executed by a different handler, or something, I dunno. It was so hard to figure out how to navigate everything. And it felt like the entire tradeoff was for the purpose of stopping smoothbrains from adding the ORM to an API endpoint - but with the cost of this crushing accidental complexity that made it hard for everyone to hold everything in their heads, not just for me but also for the smart guys on the team.
It turns out that the LLMs also performed extremely poorly. All the heavy abstractions were too hard for them (not to mention most of the developers).
I knew I had no chance of shifting things away from that paradigm. But as luck would have it... we started basically vibe-rewriting it from scratch without bullshit enterprisey crap and its (a) dead simple (b) has most of the features after one month (c) even though the code is questionable, inelegant AI slop, with nearly zero regard to proper architectural design, it's way easier to deal with than before
> Services/modules organised by infrastructural layer rather than by feature.
That is actually true to the original CA as far as I am aware. The "veritcal slices" style of development came after.
> with the cost of this crushing accidental complexity that made it hard for everyone to hold everything in their heads, not just for me but also for the smart guys on the team.
What? You don't like editing 50 files every time a new column is added to a DB table?
> And it felt like the entire tradeoff was for the purpose of stopping smoothbrains from adding the ORM to an API endpoint
There is nothing wrong with this, and I will die on this hill. The entire purpose of CA was for people to make money off book sales, lectures, and consulting. Notice how every single on of the people involved with promoting CA have absolutely nothing noteworthy to their names. In fact, you might be surprised who was one of the consultants on a major failure of a solution... *cough cough*
I regularly prompt and re-prompt the clanker with esoteric terms like "subtractive changes", "create by removing" and more common phrases like "make the change easy, then make the easy change", "yagni", and "vertical slices", and "WET code is desirable".
It mostly works. CC's plan mode creates a plan by cleaning up first, then defining narrow, integrated steps. Mentioning "subtractive" and "yagni" appears to be a reliable enough way for an LLM to choose a minimal path.
To my mind these instructions remain incantations and I feel like an alchemist of old.
Was just listening to the Lenny’s Podcast interview with Simon Willison, who mentioned another such incantation: red/green TDD. The model knows what this means and it just does it, with a nice bump in code quality apparently.
I’m trying out another, what I call the principle of path independence. It’s the idea that the code should reflect only the current requirements, and not the order in which functionality was added — in other words, if you should decide to rebuild the system again from scratch tomorrow, the code should look broadly similar to its current state. It sort of works even though this isn’t a real thing that’s in its training data.
"Coding style should reflect the casual minimalism of expert programmers" used to to bump it up a bit, I guess it had the right activations or whatever, but I haven't bothered as much with this stuff with recent models cause (a) writing "You are a gigachad developer who is a Level 99 Staff Wizard at FAANG" doesn't work any more and (b) More or less they code like what they can see in their context; if you have a shit codebase you're going to get more shit, so theoretically I'd give a baseline coding style in some instructions, but realistically I haven't have enough motivation to bother with the last thingy I was working on.
I often say to Claude "you're doing X when I want Y, how can I get you to follow the Y path without fail" and Claude will respond with "Edit my claude.md to include the following" which I then ask Claude to do.
Not sure this is a great idea. The model only internalized what it was trained on and writing prompts/context for itself isn't part of that. I try to keep my context as clean as possible, mostly today's models seem smart/aligned enough to be steered by a couple of keywords.
And yet the divisions that built those smart speakers have been reduced to almost nothing, because the monetization capabilities were minimal, as their common use cases are rather low value. The devices were priced quite low to try to gain marketshare, but it was a share to a market with minimal value.
The value of IoT that has been unlocked is, at best, minimal convenience. It's not unlike the metaverse: Large investment has been made, but there's no killer apps. I cannot even begin to imagine anything I'd consider high value all that home automation could do for me. The best case is like power windows in cars: Better than having to turn the handle like back in the days, and nowadays cheap enough to have 100% of the market, but, at best, a commodity, as nobody cares about which power window mechanism is being used.
Given how low the ceiling is, and how annoying an IoT's ecosystem's technical problems are, Apple shouldn't touch the market with a ten foot pole.
The idea that LA is an unbelievably dense location is puzzling. My Spanish hometown is significantly denser than Los Angeles. Even large parts of NYC are not in any way dense by global urban standards.
As for people parking in the street in the US, you will find them in many smaller cities. Look at random pictures of south St Louis: Plenty of neighborhoods built before every house had a 2 car garage, and therefore with a lot of on-street parking used every day. And that's with single family homes. Hell, you find this in deep suburbs too, where someone decides they want 4 cars, and have the garage full of crap. I could take pictures of at least ten cars parked on the curb, and at least 40 outdoors in driveways if I went for a one mile loop around my 4th ring suburb.
Now, not that this is the main reason Americans still use cars to go anywhere right now, as the rest of the infrastructure around me also makes car mandatory. Suburbs with houses 3 miles from the nearest business, shops inaccessible on foot, streets that, while supposedly crossable, are extremely unsafe to pedestrians... In a world where, say, we limit each household to one car, my entire suburb becomes abandoned, and most businesses collapse, kind of like a place like Madrid collapses if one didn't run any public transit for 4 months.
When we look at automobiles, we also see that there were many ways to adapt to them. It's true that there's many parts of the anglosphere where, without one, you are a second class citizen at best: The lived environment was built so that you could not live without them... but that's not the only choice. I spend part of the year in Spain all the time, and I might not get into a car once a month. Not because I am any kind of enthusiast, but because in the town I live when I am there, it's doesn't really help.
The different however is network effects. When we make a place better for cars, I make it worse for pedestrians. Your adoption of the car, and its pressure on my lived environment, has effects on me. Same as, say, people joining facebook or twitter. But do LLMs create network effects that are directly harmful, or is it just a matter of making it harder to compete, just like a mechanical watchmaker has less business now that it's so easy to have a reliable clock? Because the first case is a problem, but the second one... that's competition. It's civilization. And then it's not really a matter of cars vs pedestrians.
I think it does have some network effects. When people are sending you 800 line markdown "planning documents" and "specs", drowning you in slop, it induces demand for LLMs to re-deflate that content into something manageable.
That's quite the uncharitable view. Let's try a better one.
Changes on what humans need to remember what to do have, for as far as we have written records, changed the skills humans hone over time. They change our fitness function. Some of those changes are bad for a while, and then get better. Others are just far better at all times. Others might get rejected. Either way, it takes a long time before we know what the technology does to us: See how cheap printing is directly linked to wars of religion.
So it's not that AI could not be bad in the short run, or even in the long run: It appears to be the kind of technology where one cannot evaluate without significant adoption, and at that poing, we are in this rollercoaster for a while whether we want it or not. See social media, or just political innovation, like liberal democracy or communism. We can make guesses, but many guesses made early on look ridiculous in hindsight, like someone complaining about humans relying on writing.
It's not that juniors are replaceable, but that hiring them is a high variance move. Few, if any, know whether a candidate is just memorizing leetcode and is going to be a dud, costing you effort before they get a PIP, or you are hiring a very talented individual that will be contributing in 2 weeks
.
With seniors, you risk less, just because the track record makes the very worst candidates unlikely.
People often undervalue scaffolding. I was looking at a bug yesterday, reported by a tester. He has access to Opus, but he's looking through a single repo, and Amazon Q. It provided some useful information, but the scaffolding wasn't good enough.
I took its preliminary findings into Claude Code with the same model. But in mine it knows where every adjacent system is, the entire git history, deployment history, and state of the feature flags. So instead of pointing at a vague problem, it knew which flag had been flipped in a different service, see how it changed behavior, and how, if the flag was flipped in prod, it'd make the service under testing cry, and which code change to make to make sure it works both ways.
It's not as if a modern Opus is a small model: Just a stronger scaffold, along with more CLI tools available in the context.
The issue here in the security testing is to know exactly what was visible, and how much it failed, because it makes a huge difference. A middling chess player can find amazing combinations at a good speed when playing puzzle rush: You are handed a position where you know a decisive combination exist, and that it works. The same combination, however, might be really hard to find over the board, because in a typical chess game, it's rare for those combinations to exist, and the energy needed to thoroughly check for them, and calculate all the way through every possible thing. This is why chess grandmasters would consider just being able to see the computer score for a position to be massive cheating: Just knowing when the last move was a blunder would be a decisive advantage.
When we ask a cheap model to look for a vulnerability with the right context to actually find it, we are already priming it, vs asking to find one when there's nothing.
Within the US, it's far more common than you think. That's typical senior dev money in a large company in cities like St Louis or KC. What is rare outside of the biggest markets is the whole "enough RSUs to double your salary" thing.
On a topic like cybersecurity, we never win by not looking: One needs top of the line knowledge of how to break a system to be able to protect it. We have that dilemma dealing with human experts: The same government sponsored unit that tells you that you need to update your encryption can hold on to the information and use it to exploit it at their leisure.
Given that it's absolutely impossible to stop people not aligned with us (for any definition of us) from doing AI research, the most reasonable way forward is to dedicate compute resources to the frontier, and to automatically send reasonable disclosures to major projects. It could in itself be a pretty reasonable product. Just like you pay for dubious security scans and publish that you are making them, an LLM company could offer actually expensive security reviews with a preview model, and charge accordingly.
When we go with any other good in the economy, price is always relevant: After all, the price is a key part of any offering. There are $80-100k workstations out there, but most of us don't buy them, because the extra capabilities just aren't worth it vs, say a $3000 computer, and or even a $500 one. Do I need a top specialist to consult for a stomachache, at $1000 a visit? Definitely not at first.
There's a practical difference to how much better certain kinds of results can be. We already see coding harnesses offloading simple things to simpler models because they are accurate enough. Other things dropped straight to normal programs, because they are that much more efficient than letting the LLM do all the things.
There will always be problems where money is basically irrelevant, and a model that costs tens of thousand dollars of compute per answer is seen as a great investment, but as long as there's a big price difference, in most questions, price and time to results are key features that cannot be ignored.
And on Jess' comments on validating docs vs generating them... It's a traditional locking problem, with traditional solutions. And it's not as if the agent cannot read git, and realize when one thing is done first, in anticipation of the other by convention.
I'm quite senior: In fact, I have been a teammate of a couple of people mention in this article. I suspect that they'd not question my engineering standards. And yet I've no seen any of that kind of debt in my LLM workflows: if anything, by most traditional forms of evaluating software quality, the projects I work on are better than what they were 5, 10 years ago, using the same metrics as back then. And it's not magic or anything, but making sure there are agents running sharing those quality priorities. But I am getting work done, instead of spending time looking for attention in conferences.
reply