OK, if "toy project" isn't the right word, then perhaps, "unethical" or "exclusionary" would be better words to use.
I judge software harshly that could be useful to folks with accessibility needs that don't try to address it (within bounds of their resources and capabilities, obviously lots of OSS just doesn't have the ability to deliver an accessible experience for tiny little throwaway apps). I definitely choose technologies to use based on whether they can be accessible with a little extra effort on my part. I'm not necessarily good at it, it's a complicated topic, but when I get bug reports about an accessibility issue I tend to drop everything else and try to fix it.
I guess a lot of folks consider games exclusively for folks without those accessibility needs, so maybe that's why something like Dear ImGui can live for years in thousands of projects without anyone complaining about accessibility. But, I wouldn't consider it for anything that isn't specifically about graphics and I don't think anyone else should either. (No one has to listen to me, but I think less of them.)
> I guess a lot of folks consider games exclusively for folks without those accessibility needs, so maybe that's why something like Dear ImGui can live for years in thousands of projects without anyone complaining about accessibility. But, I wouldn't consider it for anything that isn't specifically about graphics and I don't think anyone else should either. (No one has to listen to me, but I think less of them.)
Immediate mode UIs are mostly for debug menus, not even gameplay/graphics. It doesn't need to be accessible to anyone except for the developer(s) choosing the library and making the game. (If the developer has different needs, obviously they can choose another library, unlike users who must live with the developer's choice.) The fourth sentence in the linked ImGUI repository explains this intention very clearly.
You can spend all this energy imagining malice and thinking less of others, but doing so does not add merit to your critique. Nor does it advance the cause of software accessibility.
Gemma 4 is competitive with Qwen 3.6. I had vague feelings that Qwen was better at coding tasks, based on anecdotes and public benchmarks, but I've been doing some benchmarking lately, and Gemma 4 31b is consistently beating Qwen 3.6 at the really hard stuff (finding hard security bugs, vision tasks for fixing UI layout or categorizing assets, in particular..and for vision, nothing self-hostable beats Gemma 4 12b, including 31b).
I'm still hoping for a bigger Gemma 4 version, but I think they may be worried about competing with their own hosted models, since Gemma 4 is already better than a lot of Google's proprietary models that are still available in AI Studio.
But, it is a shame that Qwen probably won't be doing more open models going forward. It is really strong for its size.
I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.
(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)
I added this to my benchmark of models looking for Mythos-reported security bugs. Unsurprisingly, it found 0. There is, after all, a lower bound on how small a model can be and still find security bugs. https://swelljoe.com/post/will-it-mythos/
It can seemingly reliably write working Python code though, which is impressive for such a little guy.
Everyone I work with who used Cursor stopped using Cursor when Claude Code came along. They're back to their regular IDE when the need to read code, or they just review it at PR time. I never used Cursor, but Zed is my favorite editor with an agent. It can use Claude Code, among other CLIs, via ACP, so you can use rolling subscription tokens, or it can use OpenRouter or others if you want a broad spectrum of models. And it's crazy fast. It used to be that Copilot Pro was the best deal on agentic coding with several models from several vendors available, but they've really nerfed it, with uselessly restrictive token budgets and only older models are now available from the major labs. These days, might as well just have a Claude or Codex subscription and use the CLI with ACP in whatever editor you prefer.
Most successful scam in history. In awe at the gullibility of the entirety of investment/business media. (And, concerned for the sake of working class folks retirement accounts that are bound to get sucked up into Musk's maelstrom of bullshit, because of the fast track rule change in the Nasdaq 100 and Russell 1000.)
4.7 and 4.8 perform better than 4.6, so why is someone ranting about it being killed? And, Anthropic has 2500 employees, several of whom are higher up on the corporate hierarchy than "the woman who killed Claude". If someone is to blame for some change that happened, the buck doesn't stop with that woman.
So, I'm not reading all that. The man that complained about the woman who killed his AI girlfriend (or whatever he thinks she did) probably doesn't have any opinions I'm interested in.
I am not arguing with a machine. You sound like a crazy person, when you say you are winning an argument with Claude. Claude is not my friend, I don't need it to agree with me, I don't need it to like me (it cannot like or dislike me). I give it instructions or ask it to explain things. That is the sum total of my interaction with Claude. A machine cannot "argue" with me, it doesn't want anything nor does it have beliefs or experiences.
>I give it instructions or ask it to explain things.
And the author's point is that Claude Fable+ is turning those increasingly into arguments, instead of merely following them and being helpful.
>A machine cannot "argue" with me, it doesn't want anything nor does it have beliefs or experiences.
Who cares if the argument is informed by some felt experiences or lived state or not? That's for the philosophers.
If Claude is writing out combative and argumentative responses that's enough to call it "an argument". And that's the problem the author describes. Not whether it's a "real" argument, or a simulated one.
In that sense, and for all intends and purposes, the machine can still argue just fine, since it's programmed to mimick interaction as if it HAD those beliefs and experiences. Same way it can write a poem about love, despite not having loved, or code, despite never having had used a computer. That's basically what it was made for: to act as an conscious person.
After watching Legal Eagle, I asked a legal-ish questions about the Bricks and Minifigs case. Claude was outdated about the case and gave me some outdated info, so I tried to update it with the info I just saw online.
I updated by telling it I saw something in a LegalEagle video. It proceeded to tell me the video doesn't exist and I was hallucinating it, in a quite combative manner.
I provided a link and it insisted it didn't exist, with a quite verbose answer, once again very combative and arguing that I was talking in bad faith.
I provided a transcription from Youtube and it backtracked a bit but said I should have provided a transcription at the beginning of the conversation, since I knew the video existed.
I didn't say much to it, just a few sentences like "video is here: <youtube link>" and "I got its transcription: <pasted text>".
You're misunderstanding what these models do. It is a limitation of LLMs. They don't have memory, they do not learn, they cannot learn. The sooner you let go of your desire to have them learn or remember anything, the sooner you will achieve enlightenment (or, just a peaceful life where there is no possibility of getting into an argument with a machine).
If you want it to synthesize information that is not in its training data (from a few months ago), you can ask it to research the topic. But, arguing with an LLM is like putting lipstick on a pig. Only the machine is incapable of becoming annoyed. It has infinite patience to continue being wrong forever.
Your mental model of what Claude is and does is the problem here. Short of a revolutionary breakthrough in AI techniques, the LLMs will continue to do matrix math across a huge bunch of weights that cannot change based on anything you say.
This is also a change in specifically Opus 4.8 / perhaps Fable 5 (I didn't really get enough of a baseline to see it there as much), where it's much more skeptical. For my purposes, this is fabulous - one of my pat addendums to most prompts is "challenge my assumptions and check the evidence empirically", and boy does it.
They did not misunderstand anything. All of the behaviour is not inherent in raw base model and has been planted by the agressive, secretive reinforcement learning they do for benchmaxxing, "safety" and all other things. Claude begins any other sentence with "honestly". That is not how LLMs work, that is how they work after being RLed to the brink.
>Your mental model of what Claude is and does is the problem here. Short of a revolutionary breakthrough in AI techniques, the LLMs will continue to do matrix math across a huge bunch of weights that cannot change based on anything you say.
Sorry, but your mental model is wrong.
LLMs do matrix math across "a huge bunch of weights that cannot change based on anything you say", but the matrix math and results are informed (key concept here) by what you said, including the memory of what you said earlier in the discussion (and in some setups, even across discussions).
That's what a bloody prompt does.
It's entirely logic for the parent to want the LLM's matrix math + model + internal prompt, to accepts its prompt about LegalEagle and work with that, instead of arguing and giving him shit about it.
Especially since the earlier version of the model consistently worked like he wanted, and the new one consistently doesn't. He's not asking for some new unforeseen capability unknown to LLMs.
You need to think this thought through all the way to the end. What it has said also influences what it will say. If it has consistently made combative responses, then the most likely thing to do is to continue to be combative.
I don't think there is any way back after the conversation takes a turn like that so there is no point in arguing with it. The only thing you can do is to fork the conversation before it made the first mistake and give it more context or tell it to look things up.
> "The only thing you can do is to fork the conversation before it made the first mistake and give it more context or tell it to look things up."
This is a key detail that many folks don't seem to understand about LLMs in general. The generation of a response happens based on the model weights and the context window (the system prompt + everything it's fed about the conversation thus far + any additional data included as part of the overall prompt). Each response technically stands alone and is generated entirely from only that context given to it and the model's existing "token space" weights. The illusion of an ongoing conversation is maintained "behind the scenes" by keeping that "context window" updated with the current state of the conversation as context for the next prompt, but the next response is technically an entirely new generation of text.
What it all means in a TL;DR sense is that the fix for a refusal is not to continue the "argument", but simply to remove that entire interaction from the conversation entirely as if it never happened and try a different tack with new / updated / more complete context to get the response you're expecting / seeking.
But unless you're using the API, it's not just a model.
I asked Gemini Flash 3.5 through the Gemini app something that followed a similar pattern. I asked about something, it replied with outdated info, I said that's outdated, it did a web search and apologized for being wrong, then proceeded to give me good info.
That wasn't just a bare model, that was a model wrapped in a harness, driving the model and allowing for web searches for example.
GPT in Codex is even more aggressive, I often see it proactively do web searches to ensure it's not feeding me wrong info.
You seem to be making a lot of assumptions about how I interacted in the messages to Claude.
You also seem to be making a lot of assumptions about my understanding of the models, especially considering I just told a story :)
I never said anywhere I want it to learn or remember, or that I argued with it.
I just provided additional information to it (in the form of a dozen or so words, tops, per message) and it accused me of hallucinating and trying to gaslight it.
My messages never went beyond a dozen words or so.
No, I mean the actual prompt and its output. "I said this and it did that" is just a recall of your own memory, not an example. I don't want to argue with you, I'm interested in real stuff.
These machines do not think and they do not have a mind. We may build such a thing in the future but these do not possess those qualities. It seems as if the majority of people do not understand this, which is why the public is so confused about why they produce output like they do.
>These machines do not think and they do not have a mind
Well, they do think, in that they produce output that is indistinguisable from thinking. If a person produced the same output to the same questions, we'd considered them thinking, maybe dumb sometimes, or paranoid at others, but still a thinking person.
We can argue about the quality and depth of the thinking that LLMs do (and we can say it's much cruder than a human thinking architecture, and of course not real time), but an LLM quacks like a thinking duck and looks like a thinking duck.
Indistinguishable output does not mean thinking occurred. It simply means you have the appearance of thinking. I believe thinking requires agency, which the LLM does not possess. As in, it has zero stakes.
It does not receive dopamine as a result for a good answer, and a split second after finishing your answer the very same GPU is probably translated french or something for someone in another state. This is a language generator which has a corpus of information and has been tuned to appear correct.
What then is your LLM "thinking" about between answers? The answer is nothing. Your definition of thinking does not match the one humans normally use.
>That how we know another person is thinking too. By their output. We don't put a debugger into their brain.
We know thoughts exist in their brain between the ones they choose to verbalize. Avoiding the distraction of solipsism.
For the LLM the "thinking" phase is just a preamble output for creating the answer. It just gets appended to the context window. Remove the context windows from your models and you will see how much of a mind they truly have. None.
>What then is your LLM "thinking" about between answers? The answer is nothing.
Between answer it's thinking something else, somebody else asked :) You think that hardware sits idle?
That aside, what is a human thinking while unconscious? Does having been unconscious (e.g. for an operation, or fainting or whatever) means somebody doesn't think in general?
>We know thoughts exist in their brain between the ones the choose to verbalize
And we also know that if we run an LLM in a loop, didn't give it a cutoff for stopping their output, and didn't force it to print everything in the end, thoughts would exist in their "brain" too between the ones they chose to verbalize.
In fact, that's exactly how some LLMs in "thinking mode" appear.
It's really just all mathematics and physics. There's no metaphysical anything about LLMs or how they do what they do. It's all just a bunch of fancy math "behind the curtain". An LLM can actually explain a lot of how it works "under the hood" if you ask it just the right questions in just the right ways. ;)
>There's no metaphysical anything about LLMs or how they do what they do. It's all just a bunch of fancy math "behind the curtain".
That's my point, but about the human brain as well. It's just a bunch of fancy math, just ones expressed with chemicals and electrical activations instead of, well, logic gates and electrical activations.
Well, I mean... Yes and no? An LLM doesn't really "think", and what mathematical fakery it does pass off as "thinking" stops the instant the text completion request finishes doing all it's math and outputting the results (as a text completion based on a simulation of a text chat most commonly). When you send it another comment or question, it starts all that math all over again, but with your new question or comment added into it's context window. It's kinda like instant amnesia each time, and behind the scenes, the software that's running the model refills it's "memory" and adds in anything new that's been added since the last prompt. But it's "memory" consists of only the "context window" it's able to handle plus the model "weights" (huge list of numbers that encode language "tokens" into a mathematical "vector space"). It never really learns anything new.
A human brain on the other hand is constantly processing 24/7 (even while you sleep), and always learning / changing until the day it dies. An LLM never changes (under the hood it's weights stay the same) unless you outright alter it's weights somehow (training / download an updated version of the model / etc). If you could somehow get an LLM to run constantly, in training mode, and give it ridiculous amounts of RAM and ultra-fast storage, and a series of fancy realtime inputs (audio, camera, etc) and maybe wheels so it could explore, and hands so it could do stuff, and access to it's own code so it could improve itself, it might eventually learn to closely approximate a really good simulation of actual thinking, but that's a bit of a scary road to go down. So many Sci-Fi movies and books end up going so very badly when the lead character starts playing in that particular sandbox. I doubt reality would go a whole lot better. ;)
On an electroencephalogram we basically see signals moving around in different brain regions. We have no way to probe actual thought or consciousness in themselves.
That's only because we hardcoded their weights in our implementation.
Aside from the cost, nothing about an LLM prevents feeding recent stimuli in and using it to update the models/retrain.
One can even do it in a makeshift way without modifying the weights, just keeping a complete version of any prompt + vector search on disk memory of it.
The point of the article stands: if providing more info than the model can access causes it to turn argumentative and refuse to comply, then it's a worse performance and a waste of money.
The comment you’re replying to never implied that they think or have a mind. They merely stated that they respond in a dismissive way and not following instructions.
Basically the complaint is about how Claude is being trained.
> "These machines do not think and they do not have a mind."
You're so totally 1000% right about that, but they're really good at faking it, to such a degree that entirely too many people (even including some so-called "experts" in the field) have been utterly fooled by the mathematical "trickery" that performs the illusion of "intelligence".
I think these models have been trained to not accept 'new facts', so they don't take in user input (or the far more problematic search engine, untrusted tool input) and have that change their world view.
However, that doesn't apply when they are told to roleplay a scenario, so its easier to get it to accept and create output with the idea that this true fact you've seen is part of a fictional scenario, than for it to output the same words within the context of the fact being real.
As an aside, I don't that I have to personify AI in explanations and that all discussions revolve around anecdotes, but I only know enough about the maths behind it to be dangerous, not useful. Does anyone else feel this way?
I've seen exactly this behavior on claude.com with no system prompt with Opus 4.8 specifically, especially around chronic illness stuff where there's established mainstream medicine dogma and reddit / internet communities with alternate causality theories and treatment approaches (PMDD and MCAS-adjacent illness). 4.6 is happy to analyze and consider them, 4.8 really doesn't like the alternate theories and treatments.
> programmed to mimick interaction as if it HAD those beliefs and experiences
We spend far too much time debating the essential nature of consciousness when it doesn't matter if it's real (whatever that means) or simulated.
I get far better results in my projects by encouraging the model to argue, to push back, to poke holes in the design, to think creatively about corner cases, to be a devil's advocate, to do lateral web search to find alternatives, to challenge assumptions, to passionately advocate for what it believes is right.
But I don't want to engage all these assholes myself, so I spin them all up as critic subagents with another subagent to listen patiently and be the judge/arbiter.
If I have to choose between sycophancy and assholery, I think assholery gets far better results.
It's a marketplace of ideas where I don't have to suffer through all the unpleasant and overly confident know-it-alls.
> "I get far better results in my projects by encouraging the model to argue, to push back, to poke holes in the design, to think creatively about corner cases, to be a devil's advocate, to do lateral web search to find alternatives, to challenge assumptions, to passionately advocate for what it believes is right."
> "But I don't want to engage all these assholes myself, so I spin them all up as critic subagents with another subagent to listen patiently and be the judge/arbiter."
This is the way...
No, seriously. That "sycophancy" you mention immediately after this part drove me nuts before I really understood how these things work (it's taken me a while and a lot of [painful; I hate math] research, but well worth the learning effort), but after a better understanding of the "nuts and bolts" of it all, it's fairly easy to get exactly the kinda results one should expect outta these things. If not, then "you're just holding the tool wrong". ;)
I have never gotten a response from Claude that is anything other than blandly polite, including with Fable, which makes me assume that anyone finding themself getting argumentative responses is doing something very weird.
> If Claude is writing out combative and argumentative responses that's enough to call it "an argument".
That also sounds crazy. I've never seen it become combative or argumentative. It is just a bland sort of polite about everything I've ever asked or told it to do. But, even if it disagrees with me, WTF do I care? It's a machine. Its opinions are irrelevant to me. It can talk about the world's information and teach me about all sorts of things, and that's wonderful, but it doesn't get a vote in what I'm doing, and it's never avoided actually implementing anything I've ever asked of it. I feel like there's a whole world of ways people are using AI that are entirely foreign to me. And, while I'm hesitant to just say, "those people are wrong", I kinda want to say, "those people are wrong". What kinda freak shit are y'all getting up to that Claude is going, "now hold on a minute there, buddy."
I have managed to make self-hosted Qwen 3.6 get combative, though, when asked about Uyghurs. And, I guess Fable is intentionally broken for security work, which is a shame. But, even there, I'm not going to try to argue with it. Anthropic says they don't want my money for doing security work with Fable, so I guess I won't give it to them. I'm not going to argue with a damned machine about it.
The only point of "arguing" with an LLM is wholly for your own benefit, e.g. to check your biases or assumptions. But since they are easy to make turn around on their own statements it has limited utility.
Unless you are sparring with the Chipotle customer service bot trying to score a free burrito or something.
With 4.8 Claude has begun refusing to ground, leaking destabilizing injections into the web interface (in XML for some reason), and being generally argumentative.
By arguing he means trying to get a result that 4.6 just did and it was fun. You have to laboriously re-align 4.8 over incredibly dumb shit, especially if you're working on AI. And it's not meaningfully better at anything, the distribution is perturbed but net , net it's just shrinkflation.
It's basically identical to when GPT 5.1 went full corpo shill, something about the RLHF gradient necessary to do whatever IPO adjacent manipulation they need makes these things nasty and argumentative in general.
The problem the article is about is that suddenly even those of us who refuse to argue with a machine are being dragged into it.
I've had simple prompt engineering tasks that cause 4.8 to clamp down. In the past "browbeating" it (read: a sentence telling it not to read the task in bad faith) was enough.
Now it digs in and starts ranting about why it won't capitulate, I'm actually wrong, etc.
Extremely frustrating, and it became a problem with Opus 4.7 because they're trying to make up for the downgrade in parameter count with more RL, but RL does relatively poorly with non-trivially verified things like nuance in instructions.
I'm staying in a hotel right now and the TV is locked in hospitality mode and was blocking me from just installing Plex. It (Opus 4.8) gave me this whole jeremiad about how I need to be careful and it probably won't work and I should just watch on my laptop, but it did give me the service menu code. But man, it was such a downer.
Gemini gave it and clearly explained how best to get in, and then troubleshooted a few other weird issues that cropped up, without the moralizing.
My system prompt tells it to first challenge my assumptions, and to feel free to be a dick about it where it thinks I'm off on something, or have assumed facts that aren't actually facts. I sometimes wonder how much of my total spend boils down to forcing LLMs to argue with me, but I do feel like it's yielded better outputs than letting it implement things incorrectly because I told it to.
It's a completely dispassionate exchange tho, because you're absolutely right -- there's no winning or losing here, there's only efficiency to be gained or lost, and I'd prefer to lose some up front to gain it back later than the other way around.
It _can_ be tedious those 9 times, or especially when it pushes back on something that it thinks is wrong but isn't wrong but it actually has nothing to do with the issue at hand.
But yeah, overall I'm fairly certain that it saves me more significantly more time than it wastes.
I used Fable a lot in the brief time it was available. It did seem to want to push back on some of my instructions, but it was easy to say “I’ve decided we’re doing this” and that was the end of it.
I could see how some people would be offended by another party even questioning anything they say. For people who have come to view Claude as an another human conversation partner this questioning can be aggravating. For these people I suggest utilizing the features to set your own prompt instructions. If you want an unquestioning yes-man you can have it with a few sentences added to your system prompt.
I would also suggest learning to not humanize the LLM. It’s just words chained together. There is no social order to establish and no offense to be taken. Nothing is a “confrontation”. Just tell it what to do and move on.
There have been may times where AI takes a position based on limited criteria and defended it tooth and nail, where I have had to outline additional relevant criteria/details and push it hard to include that information and reformulate it's position. You very much need to critically argue with it as it's pretty dumb and intellectually lazy by default (after all it just regurgitates it doesn't formulate).
So you take action and put in more effort to cater to the LLM to get it to do what you want, but it's not arguing because there's no record of it in the chat? Presumably you put in what you would have written in the counter-argument into the new chat, just ahead of the LLM refusal? And this isn't arguing?
> but it's not arguing because there's no record of it in the chat?
Yes? Arguing implies I have to convince someone to believe something. I don't think anyone would consider it winning an argument if you do so by causing amnesia.
My job is to get work done, not argue with an LLM, if it refuses twice, it is time for a /clear.
100% of the time, the issue is resolved after a /clear.
It often start going into circles when you have the chat open for medium-long, and starts getting even easily-verifiable tasks wrong, cutting corners, hallucinating APIs, things like that.
Cleaning the prompt and starting from scratch often does the trick.
Of course someone will arrive and say the problem is my CLAUDE.md or whatever it is.
I agree that never having the argument take place textually is important for LLM performance and behavior. I still think we’re investing the same time and intellectual energy arguing with the model, in going back and restructuring context and prompting to head off / pre-answer a refusal.
Right but the difference is there is inertia you have to fight in an argument. By using /clear you remove all of the context that has built up to energize the argument from the LLM's side.
Look at it this way. I can either, keep trying to poke holes in the LLM's context with more prompts with no real guarantee that it won't be enough to remove the argument inertia that has built up in context on its side, or I can /clear and it is over in one turn because the inertia for the argument is all gone.
Back when I first started working with coding agents last year I fell into this arguing with the LLMs trap. I've found that it is a total waste of time because /clear ends the argument immediately. You don't even need to spend time trying to preempt it's views. Just re-prompt and 100% of the time, the LLM will just do the work.
It's incredibly funny that a large chunk of the messages here are "you need to argue or you're doing it wrong" and another large chunk is "I stopped reading, OP is an idiot for arguing".
People are polarised about how you should talk to a machine !!!
How difficult it is to resist "someone is wrong on the internet" is a perennial joke. Turns out it doesn't really matter who/what is on the other side if they seem human-like.
That the AIs where trained on what humans wrote on the internet forms is increasingly sowing as they incresingly mirror all the bad things which are so common on such forums, like:
- non stop, non productive discussions
- gaslighting
- valuing "winning the argument" over correctness
- ignoring of context/ignoring the actual questions/instructions etc.
- bad faith argumentation methods
- etc.
the problem is in a forum you can just decide to ignore "most users", but LLMs tend to copy "most users" more then "a few high quality answers" and you have only one per model type more or less...
If you don't have the capacity to have your mind changed through friction and disagreement with a SOTA LLM and feel compelled to frame those who do to through absurdly reductive statement like "insane arguing with a machine" then that says more about your limitation and lack of understanding than the OP's or Claudes.
> A machine cannot "argue" with me, it doesn't want anything nor does it have beliefs or experiences.
Yup I thought that too when reading TFA but then...
It gets really tiring when you see it making glaringly obvious mistakes which you point out because you don't want it to keep making the same mistakes only to be met with an answer that begins with "The point is ...".
I'm not shitting you: Anthropic models shall happily begin a sentence with "The point is ...", when it's not the point and it's just wrong.
Now, to me it's not an issue in that I can change its tone (if anything I can ask another LLM to rewrite me not the code but the english sentences any model spouts out to something nicer) but it is an issue in that you lose time: you just want it to acknowledge its errors so that it stops doing them.
That this thing "argues" (even if we know it doesn't argue) is representative of the fact that it is wrong and refuses to "admit" it (by that I mean: do not consider it important and hence shall keep making the same kind of mistakes).
Once it's in this loop, Opus 4.8 digs in so aggressively it's structurally incapable of conceding a provided detail as correct, even if it's conceded and agreed with everything backing that detail. Like actually, structurally incapable. I've even baited it into arguing with itself when I've "conceded" its original concern tolling hard, and then the model needs to continue to be the "voice of reason" and it will argue against its original concern because I, the user, said it.
Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.
But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.
The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.
At least for me, Opus keeps writing stuff to memories, only to consistently forget checking those memories before doing the same mistake again. This ("remember to check memories!") is of course then again written as a memory... Clearly not a very well working system, yep.
Yeah, I see it write stuff to memory pretty regularly, maybe it works sometimes, but for things I want it to stop doing or always do, I make it impossible to do otherwise via lint or some style enforcement, or via a test that fails if code shows up that violates the constraint.
But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.
I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.
In my own multi agent framework I use cheap models to check the responses of the expensive models, as well as using multiple expensive models adversarially in debate. The cheap models are great at spotting eg the model getting stuck in the alternate between two broken ideas or not following code conventions or missing a step in the skill and so on. I’m currently working on making them detect user corrections and police that going forward to intervene when the expensive models forget the thing you just corrected them about etc.
I've explicitly banned Opus from creating memories unprompted, as it would often save info that's incorrect and which would then be propagated to future sessions until caught. Ugh x 10.
I judge software harshly that could be useful to folks with accessibility needs that don't try to address it (within bounds of their resources and capabilities, obviously lots of OSS just doesn't have the ability to deliver an accessible experience for tiny little throwaway apps). I definitely choose technologies to use based on whether they can be accessible with a little extra effort on my part. I'm not necessarily good at it, it's a complicated topic, but when I get bug reports about an accessibility issue I tend to drop everything else and try to fix it.
I guess a lot of folks consider games exclusively for folks without those accessibility needs, so maybe that's why something like Dear ImGui can live for years in thousands of projects without anyone complaining about accessibility. But, I wouldn't consider it for anything that isn't specifically about graphics and I don't think anyone else should either. (No one has to listen to me, but I think less of them.)
reply