Any person who would choose 3.7 with a fancy harness has a very poor memory about how dramatically the model capabilities have improved between then and now.
I’d be very interested in the performance of 3.7 decked out with web search, context7, a full suite of skills, and code quality hooks against opus 4.5 with none of those. I suspect it’s closer than you think!
Skills don't make any difference above having markdown files to point an agent to with instructions as needed. Context7 isn't any better than telling your agent to use trafilatura to scrape web docs for your libs, and having a linting/static analysis suite isn't a harness thing.
3.7 was kinda dumb, it was good at vibe UIs but really bad at a lot of things and it would lie and hack rewards a LOT. The difference with Opus 4.5 is that when you go off the Claude happy path, it holds together pretty well. With Sonnet (particularly <=4) if you went off the happy path things got bad in a hurry.
I've done this (although not with all these tools).
For a reasonable sized project it's easy to tell the difference in quality between say Grok-4.1-Fast (30 on AA Coding Index) and Sonnet 4.5 (37 on AA).
Sonnet 3.7 scores 27. No way I'm touching that.
Opus 4.5 scores 46 and it's easy to see that difference. Give the models something with high cyclomtric complexity or complex dependency chains and Grok-4.1-Fast falls to bits, Opus 4.5 solves things.
"Captcha" doesn't refer to any specific type of puzzle, but a class of methods for verifying human users. Some older-style captchas are broken, but some newer ones are not.
I'm aware. But I'm also aware that breaking these sorts of systems is quite fun for a lot of nerds. So don't expect anything like that to last for any meaningful amount of time.
Since before LLMs were even an issue, there have been services that use overseas workers to solve them, with the going rate about $0.002 per captcha. (and they solve several different types)
This is both true and misleading. It implies captchas aren’t effective due to these services. In practice, though, a good captcha cuts a ton of garbage traffic even though a motivated opponent can pay for circumvention.
That may be true, but it does seem like OP's intent was to learn something about how LLM agents perform on complex engineering tasks, rather than learning about ASCII creation logic. A different but perhaps still worthy experiment.
That might be true for a narrow definition of chatbots, but they aren't going to survive on name recognition if their models are inferior in the medium term. Right now, "agents" are only really useful for coding, but when they start to be adopted for more mainstream tasks, people will migrate to the tools that actually work first.
Claude Code's Plan Mode increasingly does a (small-scale) version of this - it will research your codebase and come back to you with a set of clarifying questions and design decisions before presenting its implementation plan.
Your comment assumes that the purpose of an economic system is to preserve a financial status quo. It isn't, and shouldn't be. Inflation incentivizes people to put their money to productive uses in the economy (capital formation) rather than hoarding resources.
Allowing the financial system to inflate the money supply destroys two of those fundamental qualities.
The fact it can additionally charge interest on the money funnels the stolen value into its hands.
Interest on money loaned out is the only incentive required for putting money to "productive uses". Nothing about hard money affects that. In fact inflation only causes the people at the top of the pyramid to hoard all of the economic value instead. They are buying up and hoarding the entire world with the wealth they are taking from the people.
Bitcoin was envisioned by its creator to be used as a currency. To buy and sell stuff using it. If you ask today what is bitcoin you'll be told that it is a store of value. The purpose of money is not to be a store of value. It can be, but that is not its purpose which the case of bitcoin clearly illustrates.
> Interest on money loaned out is the only incentive required for putting money to "productive uses".
And what is the incentive to loan money in your system?
To become a medium of exchange, it needs to become a unit of account. That will happen as it's value stabilises, and that will only happen once it's proved itself as a store of value.
What if Henry Ford evisaged his Model T being used as a temporary alternative for when your horse is unwell? Or a fairground ride? Bitcoin is what it is.
> And what is the incentive to loan money in your system?
Interest - the age-old solution. Offer me interest that both compensates me for not having use of my money and for the risk of getting it back, and we have a deal.
Value can only stabilise if there's either someone in charge adjusting the rate of printing to maintain a stable value. It cannot be done algorithmically as there's no way to determine the value from inside the system.
Non-deflationary currencies encourage hoarding which leads to wild swings in value. Deflationary currencies do much better. Look at the price chart of BTC vs XMR.
It depends how you measure value. By stabilise I mean stops growing in value by 50%/yr with big short term swings of 80%.
As it matures and gets close to it's ultimate value, volatility will naturally reduce.
Once it is used as the unit of account, everything else will fluctuate in value relative to bitcoin, which has more stable fundamentals than anything else on earth (fixed/zero issuance, liquidity etc), but this will be decades in the future when it's dollar value will be 8 or 9 figures in today's money
Not at all. It naturally stabilises the closer it gets to its ultimate market cap. The more it stabilises the more it will be used as a medium of exchange.
There is no evidence for this. When gold and gold-backed currency were used for trade, it fluctuated in value wildly and there were several depressions each decade. After centrally-issued fiat currency was introduced, it had a much more stable value, since it could be issued counter-cyclically.
How are you measuring the value of gold? How are you sure it's not the value of the quote asset that fluctuating wildly. If everything was priced in gold do you really think that the prices of everything would fluctuate wildly? For what reason? The only reason for any sudden changes in value of gold are due to demand, which is caused as people move their wealth out of fiat currencies which are collapsing in value.
The prices of things vary due to speculation as well as demand and supply. Gold is an industrial metal which has to be mined, thus making supply uncertain and exposed to the whims of miners. Central bank issued currencies are managed by varying the supply according to economic conditions, which has given economies much better stability as a result.
There's nothing special about gold except that some people think that it's a panacea. It's not, it's still an industrial metal. "Fiat currencies which are collapsing in value" has no basis in fact. It's just mindless ideology at best and no better than a conspiracy theory.
Fiat currencies are collapsing in value. It's a fact.
Price fiat in any hard asset like gold, real estate, bitcoin etc and it's obvious to see. The devaluation can be seen to be almost directly linked to the increase in the supply which approximately doubles each decade (no coincidence that real estate prices do the same).
There is something very special about gold. It has the best monetary qualities of any physical substance. It is only bettered by bitcoin which essentially dematerialises gold - stripping away it's physical attributes that hinder it (portability, scarcity, verifiability, divisibility etc).
I can answer that question: you don't have any. Your vague anecdotes don't count as such. You need hard figures and there aren't any showing any such thing.
I explained the evidence very clearly in my last answer. You're clearly not acting in good faith. You try and help people on here and this is what you get every time. GFY.
You should get some help yourself. And "GFY"? Very classy. Clearly demonstrates good faith on your part. Really shows the strength of your argument and your ability to deliver that argument. Only the most intelligent and educated people tell people to "GFY", obviously.
But back to your "argument". Google "anecdotal" and "non sequitur". But to save you time, here's a summary: you're not providing evidence, but instead a story that you created that attempts to support your augment. Then you deliver a series of non sequiturs, where your claims don't support your conclusion.
So again, please provide evidence instead of nonsensical and unsupported claims, then we can have an actual "good faith" discussion that I'm sure will be very helpful to everyone.
I'd like others' input on this: increasingly, I see Cursor, Jetbrains, etc. moving towards a model of having you manage many agents working on different tasks simultaneously. But in real, production codebases, I've found that even a single agent is faster at generating code than I am at evaluating its fitness and providing design guidance. Adding more agents working on different things would not speed anything up. But perhaps I am just much slower or a poorer multi-tasker than most. Do others find these features more useful?
I usually run one agent at a time in an interactive, pair-programming way. Occasionally (like once a week) I have some task where it makes sense to have one agent run for a long time. Then I'll create a separate jj workspace (equivalent of git worktree) and let it run.
I would probably never run a second agent unless I expected the task to take at least two hours, any more than that and the cost of multitasking for my brain is greater than any benefit, even when there are things that I could theoretically run in parallel, like several hypotheses for fixing a bug.
IIRC Thorsten Ball (Writing an Interpreter in Go, lead engineer on Amp) also said something similar in a podcast – he's a single-tasker, despite some of his coworkers preferring fleets of agents.
I've recently described how I vibe-coded a tool to run this single background agent in a docker container in a jj workspace[0] while I work with my foreground agent but... my reviewing throughput is usually saturated by a single agent already, and I barely ever run the second one.
New tools keep coming up for running fleets of agents, and I see no reason to switch from my single-threaded Claude Code.
What I would like to see instead, are efforts on making the reviewing step faster. The Amp folks had an interesting preview article on this recently[1]. This is the direction I want tools to be exploring if they want to win me over - help me solve the review bottleneck.
My CTO is currently working on the ability to run several dockerised versions of the codebase in parallel for this kind of flow.
I’m here wondering how anyone could work on several tasks at once at a speed where they can read, review and iterate the output of one LLM in the time it takes for another LLM to spit an answer for a different task.
Like, are we just asking things as fast as possible and hoping for a good solution unchecked? Are others able to context switch on every prompt without a reduction in quality? Why are people tackling the problem of prompting at scale as if the bottleneck was token output rather than human reading and reasoning?
If this was a random vibecoding influencer I’d get it, but I see professionals trying this workflow and it makes me wonder what I’m missing.
Code Husbandry is a good term for what I've been thinking about how to implement. I hope you don't mind if I steal it. Think automated "mini agents", each with a defined set of tools and tasks, responding to specific triggers.
Imagine one agent just does docstrings - on commit, build an AST, branch, write/update comments accordingly, push and create a merge request with a standard report template.
Each of these mini-agents has a defined scope and operates in its own environment, and can be customized/trained as such. They just run continuously on the codebase based on their rules and triggers.
The idea is that all these changes bubble up to the developer for approval, just maybe after a few rounds of LLM iteration. The hope is that small models can be leveraged to a higher quality of output and operate in an asynchronous manner.
My assumption lately is that this workflow is literally just “it works, so merge”. Running multiple in parallel does not allownfor inspection of the code just for testing functional requirements at the end
Hmm, I haven’t managed to make it work yet, and I’ve tried. The best I can manage is three completely separate projects, and they all get only divided attention (which is often good enough these days).
Do you feel you get a faster/better end result than focusing on a single task at a time?
I can’t help but feel it’s like texting and driving, where people are overvaluing their ability to function with reduced focus. But obviously I have zero data to back that up.
Rather than having multiple agents running inside of one IDE window, I structure my codebase in a way that is somewhat siloed to facilitate development by multiple agents. This is an obvious and common pattern when you have a front-end and a back-end. Super easy to just open up those directories of the repository in separate environments and have them work in their own siloed space.
Then I take it a step further and create core libraries that are structured like standalone packages and are architected like third-party libraries with their own documentation and public API, which gives clear boundaries of responsibility.
Then the only somewhat manual step you have is to copy/paste the agent's notes of the changes that they made so that dependent systems can integrate them.
I find this to be way more sustainable than spawning multiple agents on a single codebase and then having to rectify merge conflicts between them as each task is completed; it's not unlike traditional software development where a branch that needs review contains some general functionality that would be beneficial to another branch and then you're left either cherry-picking a commit, sharing it between PRs, or lumping your PRs together.
Depending on the project I might have 6-10 IDE sessions. Each agent has its own history then and anything to do with running test harnesses or CLI interactions gets managed on that instance as well.
Even with the best agent in plan mode, there can be communication problems, style mismatches, untested code, incorrect assumptions and code that is not DRY.
I prefer to use a single agent without pauses and catch errors in real time.
Multiple agent people must be using pauses, switching between agents and checking every result.
I think this is the UX challenge of this era. How to design a piece of software that aids in promoting the human-level of attention to a distributed state without causing information loss or cognitive decline over many tasks. I agree that for any larger piece of work with significant scope the overhead of ingesting the context into your brain offsets the time saving costs you get from multitask promises.
My take on this is that the better these things get eventually we will be able to infer and quantify signals that provide high confidence scores for us to conduct a better review that requires a shorter decision path. This is akin to how compilers, parsers, linters, can give you some level of safety without strong guarantees but are often "good enough" to pass a smell test.
No... I've found the opposite where using the fastest model to do the smallest pieces is useful and anything where I have to wait 2m for a wrong answer is just on the way.
There's pretty much no way anyone context switching that fast is paying a lick of attention. They may be having fun, like scrolling tiktok or playing a videogame just piling on stimuli, but I don't believe they're getting anything done. It's plausible they're smarter than me, it is not plausible they have a totally different kind of brain chemistry.
The parallel agent model is better for when you know the high level task you want to accomplish but the coding might take a long time. You can split it up in your head “we need to add this api to the api spec” “we need to add this thing to the controller layer” etc. and then you use parallel agents to edit just the specific files you’re working on.
So instead of interactively making one agent do a large task you make small agents do the coding while you focus on the design.
My context window is small. It's hard enough keeping track of one timeline, I just don't see the appeal in running multiple agents. I can't really keep up.
For some things its helpful, like have one agent plan changes / get exact file paths, another agent implement changes, another agent review the PR, etc. The context window being small is the point I think. Chaining agents lets you break up the work, and also give different agents different toolsets so they aren't all taking a ton of MCPs / Claude Skills into context at once.
Right. A computer can make more code than a human can review. So, forget about the universe where you ever review code. You have to shift to almost a QA person and ignore all code and just validate the output. When it is suggested that you as a programmer will disappear, this is what they mean.
>You have to shift to almost a QA person and ignore all code and just validate the output.
The obvious answer to this is that it is not feasible to retry each past validation for each new change, which is why we have testing in the first place. Then you’re back at square one because your test writing ability limits your output.
Unless you plan on also vivecoding the tests and treating the whole job as a black box, in which case we might as well just head for the bunkers.
Yes, that is exactly what I mean. You ask the Wizard of Oz for something, and you hear some sounds behind the curtain, and you get something back. Validate that, and if necessary, ask Oz to try again.
"The obvious answer to this is that it is not feasible to retry each past validation for each new change"
It is reasonably feasible because the job of Production Development and QA has existed, developers just sat in the middle. Now we remove the developer, and move them over to the role of combined Product + QA, and all Product + QA was ever able to even validate was developer output (which, as far as they were ever concerned, was an actual black box since they don't know how to program).
The developer disappears when they are made to disappear or decide to disappear. If the developer begins articulating ideas in language like a product developer, and then validates like a QA engineer, then the developer has "decided" to disappear. Other developers will be told to disappear.
The existential threat to the developer is not when the company mandate comes down that you are to be a "Prompt Engineer" now, it is when the mandate comes down that you need to be a Product Designer now (as in, you mandated not to write a single. line. of. code.) . In which case vast swaths of developers will not cut it on a pure talent level.
You haven’t addressed the original question. The point is not whether the QA understands the codebase, but whether the QA understands its own test system.
If yes, the QA is manuallish (considering manual == no automate by AI) and we’re still bottlenecked, so speeding up the engineer was a loss for nothing.
If no, because QA is also AI, then you have a product with no humans eyes on it being tested by another system with no human eyes of it. So effectively nobody knows what it does.
If you think LLMs are anywhere near that level of trust I don’t know what you’re smoking. They’re still doing things like “fixing” tests by removing relevant non passing cases every day.
I think for production code this is wildly irresponsible. I’m having a decent time with LLM code generation, but I wouldn’t dream of skipping code review.
I'm with you. The industry has pivoted from building tools that help you code to selling the fantasy that you won't have to. They don't care about the reality of the review bottleneck; they care about shipping features that look like 'the future' to sell more seats.
I have to agree, currently it doesn't look that innovative. I would rather want parallel agents working on the same task, orchestrated in some way to get the best result possible. Perhaps using IntelliJ for code insights, validation, refactoring, debugging, etc.
Completely agree. The review burden and context switching I need to do from even having two running at once is too much, and using one is already pretty good (except when it’s not).
I think the problem is that current AI models are slow to generate tpkens so the obvious solution is 'parallelism'. If they could poop out pages of code instantly, nobody would think about parallel agents.
I wish we'll get a model that's not necessarily intelligent, but at least competent at following instructions and is very fast.
I overwhelmingly prefer the workflow where I have an idea for a change and the AI implements it (or pushes back, or does it in an unexpected way) - that way I still have a general idea of what's going on with the code.
I am extremely excited to use programmatic tool use. This has, to date, been the most frustrating aspect of MCP-style tools for me: if some analysis requires the LLM to first fetch data and then write code to analyze it, the LLM is forced to manually copy a representation of the data into its interpreter.
Programmatic tool use feels like the way it always should have worked, and where agents seem to be going more broadly: acting within sandboxed VMs with a mix of custom code and programmatic interfaces to external services. This is a clear improvement over the LangChain-style Rupe Goldberg machines that we dealt with last year.
smolagents by Hugging Face tackles your issues with MCP tools.
They added support for the output schema and structured output provided by the latest MCP spec.
This way print and inspect is no longer necessary.
https://huggingface.co/blog/llchahn/ai-agents-output-schema
I built a MCP server that solves this actually. It works like a tool calling proxy that calls child servers but instead of serving them up as direct tool calls, it exposes them as typescript defintions, asks your LLM to write code to invoke them all together, and then executes that typescript in a restricted VM to do tool calling indirectly. If you have tools that pass data between each other or need some kind of parsing or manipulation of output, like the tool call returns json, it's trivial to transform it. https://github.com/zbowling/mcpcodeserver
reply