A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.
The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).
The complaint is about A/B tests with no visible warnings, not AI.
There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.
It’s an assistant, answering your question and running some errands for you. If you give it blind permission to do a task, then you’re not worrying about what it does.
That's not what they're doing, they are trying to use plan mode to plan out a task. I don't know where you could have got the idea that they were blindly doing anything.
Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.
This is in no way standard practice for proprietary software, WTF is with you dystopian weirdos trying to gaslight people? Adobe's suite incl. Photoshop does not do this, Microsoft Office incl. Excel does not do this, professional video editing software does not do this, professional music production software does not do this, game engines do not do this. That short list probably covers 80-90% of professional software usage alone. People do this when serving two versions of a website, but doing this on software that runs on my machine is frankly completely unacceptable and in no way normal.
Maybe then, it's just my expectation of what they would be doing. What else is all the telemetry for? As a side note, my impression is that this is less of a photoshop and more of a website situation in that most of the functionality is input and response to/from their servers.
Telemetry is, ideally, collected with the intention of improving software, but that doesn't necessitate doing live A/B tests. A typical example: report hardware specs whenever the software crashes. Use that to identify some model of GPU or driver version that is incompatible with your software and figure out why. Ship a fix in the next update. What you don't do with telemetry is randomly do live experiments on your user's machines and possibly induce more crashing.
Regarding the latter point, the Claude Code software controls what is being injected into your own prompt before it is sent to their servers. That is indeed the only reason the OP could discover it -- if the prompt injection was happening on their servers, it would not be visible to you. To be clear, the prompt injection is fine and part of what makes the software useful; it's natural the company does research into what prompts get desirable output for their users without making users experiment[1]. But that should really not be changing without warning as part of experiments, and I think this does fall closer to a professional tool like Photoshop than a website given how it is marketed and the fact that people are being charged $20~200/mo or more for the privilege of using it. API users especially are paying for every prompt, so being sabotaged by a live experiment is incredibly unethical.
[1] That said, I think it's an extremely bad product. A reasonable product would allow power users to config their own prompt injections, so they have control over it and can tune it for their own circumstances. Having worked for an LLM startup, our software allowed exactly that. But our software was crafted with care by human devs, while by all accounts Claude Code is vibe coded slop.
I have no idea what you're talking about or why you think I got any information from asking Claude anything. The telemetry comment was about software in general, Photoshop etc., since the person I was replying to was asking why all proprietary software collects telemetry so often if not for A/B tests. That things are injected into your prompt before sending it to their servers is trivially verified by inspecting your own outgoing packets.
Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.
A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.
This isn’t what I want from a professional tool. For business, we need consistency and reliability.
I’m a huge user of AI coding tools but I feel like there has been some kind of a zeitgeist shift in what is acceptable to release across the industry. Obviously it’s a time of incredibly rapid change and competition, but man there is some absolute garbage coming out of companies that I’d expect could do better without much effort. I find myself asking, like, did anyone even do 5 minutes of QA on this thing?? How has this major bug been around for so long?
“It’s kind of broken, maybe they will fix it at some point,” has become a common theme across products from all different players, from both a software defect and service reliability point of view.
I mean it's like, really they don't even need agentic AI or whatever, they could literally just employ devs and it wouldn't make a difference
like, they'll drop $100 billion on compute, but when it comes to devs who make their products, all of a sudden they must desperately cut costs and hire as little as possible
to me it makes no sense from a business perspective. Same with Google, e.g. YouTube is utterly broken, slow and laggy, but I guess because you're forced to use it, it doesn't matter. But still, if you have these huge money stockpiles, why not deploy it to improve things? It wouldn't matter anyways, it's only upside
I don’t think they’re even saving much on vibe coding it, given how many tokens they claim they’re using. I know the token cost to them is much, much lower than the token cost to us, but it still has a cost in terms of gpus running.
Plus it’s not something we can replicate since we don’t have access to infinite tokens, so it’s not even a good dogfooding case study.
Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.
Replicability is a spectrum not a binary and if you bake in enough eval harnessing plus prompt control you can get LLMs shockingly close to deterministic for a lot of workloads. If the main blocker for "professional" use was unpredictability the entire finance sector would have shutdown years ago from half the data models and APIs they limp along on daily.
Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).
This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.
Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.
It’s exactly the same as A/B testing an interface. This is just testing 4 variants of a “page” (the plan), measuring how many people pressed “continue”.
You've groupped LLMs into the wrong set. LLMs are closer to people than to machines. This argument is like saying "I want my tools to be reliable, like my light switch, and my personal assistant wasn't, so I fired him".
Not to mention that of course everyone A/B tests their output the whole time. You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
jfc. I don't have anything to say to this other than that it deserves calling out.
> You've never seen (or implemented) an A/B test where the test was whether to improve the way e.g. the invoicing software generates PDFs?
I have never in my life seen or implemented an a/b test on a tool used by professionals. I see consumer-facing tests on websites all the time, but nothing silently changing the software on your computer. I mean, there are mandatory updates, which I do already consider to be malware, but those are, at least, not silent.
Why are you calling it out? You are interpreting the statement too literally. The point is probably about behavior, not nature. LLMs do not always produce identical outputs for identical prompts, which already makes them less like deterministic machines and superficially closer to humans in interaction. That is it. The comparison can end here.
They actually can, though. The frontier model providers don't expose seeds, but for inferencing LLMs on your own hardware, you can set a specific seed for deterministic output and evaluate how small changes to the context change the output on that seed. This is like suggesting that Photoshop would be "more like a person than a machine" if they added a random factor every time you picked a color that changed the value you selected by +-20%, and didn't expose a way to lock it. "It uses a random number generator, therefore it's people" is a bit of a stretch.
You are right, I was wrong. I think anthropomorphizing LLMs to begin with is kind of silly. The whole "LLMs are closer to people than to machines" comparison is misleading, especially when the argument comes down to output variability.
Their outputs can vary in ways that superficially resemble human variability, but variability alone is a poor analogy for humanness. A more meaningful way to compare is to look at functional behaviors such as "pattern recognition", "contextual adaptation", "generalization to new prompts", and "multi-step reasoning". These behaviors resemble aspects of human capabilities. In particular, generalization allows LLMs to produce coherent outputs for tasks they were not explicitly trained on, rather than just repeating training data, making it a more meaningful measure than randomness alone.
That said, none of this means LLMs are conscious, intentional, or actually understanding anything. I am glad you brought up the seed and determinism point. People should know that you can make outputs fully predictable, so the "human-like" label mostly only shows up under stochastic sampling. It is far more informative to look at real functional capabilities instead of just variability, and I think more people should be aware of this.
What other tool can I have a conversation with? I can't talk to a keyboard as if it were a coworker. Consider this seriously, instead of just letting your gut reaction win. Coding with claude code is much closer to pair programming than it is to anything else.
You could have a conversation with Eliza, SmarterChild, Siri, or Alexa. I would say surely you don't consider Eliza to be closer to person than machine, but then it takes a deeply irrational person to have led to this conversation in the first place so maybe you do.
Not productive conversations. If you had ever made a serious attempt to use these technologies instead of trying to come up with excuses to ignore it, you would not even think of comparing a modern LLM coding agent to some gimmick like Alexa or ELIZA. Seriously, get real.
Not only have I used the technology, I've worked for a startup that serves its own models. When you work with the technology, it could not be more obvious that you are programming software, and that there is nothing even remotely person-like about LLMs. To the extent that people think so, it is sheer ignorance of the basic technicals, in exactly the same way that ELIZA fooled non-programmers in the 1960s. You'd think we'd have collectively learned something in the 60 years since but I suppose not.
I really don't care where you've worked, to seriously argue that LLMs aren't more capable of conversation than ELIZA, aren't capable of pair programming even, is gargantuan levels of cope.
I didn't make any claims about their utility. I said that they are not like people. They are machines through and through. Regular software programs. Programs that are, I suppose, a little bit too complex for the average human to understand, so now we have the Eliza effect applying to an entirely new generation.
"I had not realized ... exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." -- Eliza's creator
I would doubt that they are just “regular software programs” as explainable ai (or other statistical tracing) has been lagging far behind.
If this is the case and the latest models can be explained through their weights and settings, please link it. I would like to see explainable ai up and coming.
Did you also notice the evolution of average developers over time? I mean, if you take code from a developer ten years ago and compare it with their output now, you can see improvement.
I assume that over time, the output improves because of the effort and time the developer invests in themselves. However, LLMs might reduce that effort to zero — we just don't know how developers will look after ten years of using LLMs now.
Still, if you have 30 years of experience in the industry, you should be able to imagine what the real output might be.
> Did you also notice the evolution of average developers over time? I mean, if you take code from a developer ten years ago and compare it with their output now, you can see improvement.
This makes little sense to me. Yes, individual developers gets better. I've seen little to no evidence that the average developer has gotten better.
> However, LLMs might reduce that effort to zero — we just don't know how developers will look after ten years of using LLMs now.
It might reduce that effort to zero from the same people who have always invested the bare minimum of effort to hold down a job. Most of them don't advance today either, and most of them will deliver vastly better results if they lean heavily on LLMs. On the high end, what I see experienced developers do with LLMs involves a whole lot of learning, and will continue to involve a whole lot of learning for many years, just like with any other tool.
After 30 years in front of the desktop, we are processing dopamine differently.
When I speak about 10 years from now, I’m referring to who will become an average developer if we replace the real coding experience learning curve with LLMs from day one.
I also hear a lot of tool analogies — tractors for developers, etc. But every tool, without an exception, provides replicable results. In the case of LLMs, however, repeatable results are highly questionable, so it seems premature to me to treat LLMs in the same way as any other tool.
Right, I've seen a lot of facile comparisons to calculators.
It may be true that a cohort of teachers were wrong (on more than one level) when they chastised students with "you need to learn this because you won't always have a calculator"... However calculators have some essential qualities which LLM's don't, and if calculators lacked those qualities we wouldn't be using them the way we do.
In particular, being able to trust (and verify) that it'll do a well-defined, predictable, and repeatable task that can be wrapped into a strong abstraction.
> if we replace the real coding experience learning curve with LLMs from day one.
People will learn different things. They will still learn. Most developers I've hired over the years do not know assembly. Many do not know a low-level language like C. That is a downside if they need to write assembly, but most of them never do (and incidentally, Opus knows x86 assembly better than me, knows gdb better than me; it's still not good at writing large assembly programs). It does not make them worse developers in most respects, and by the time they have 30 years experience the things they learn instead will likely be far more useful than many of the things I've spent years learning.
> But every tool, without an exception, provides replicable results.
This is just sheer nonsense, and if you genuinely believe this, it suggests to me a lack of exposure to the real world.
> if you take code from a developer ten years ago and compare it with their output now, you can see improvement.
really? it depends on the type of development, but ten years ago the coder profession had already long gone mainstream and massified, with a lot of people just attracted by a convenient career rather than vocation. mediocrity was already the baseline ("agile" mentality to at the very least cope with that mediocrity and turnover churn was already at its peak) and on the other extreme coder narcissism was already en vogue.
the tools, resources, environments have indoubtedly improved a lot, though at the cost of overhead, overcomplexity. higher abstraction levels help but promote detachment from the fundamentals.
so specific areas and high end teams have probably improved, but i'd say average code quality has actually diminished, and keeps doing so. if it weren't for qa, monitoring, auditing and mitigation processes it would by now be catastrophic. cue in agents and vibe coding ...
as an old school coder that nowadays only codes for fun i see llm tools as an incredibly interesting and game changing tool for the profane, but that a professional coder might cede control to an agent (as opposed to use it for prospection or menial work) makes me already cringe, and i'm unable to wrap my head around vibe coding.
Thanks for providing the context! "car is an Audi Q6 e-tron Performance" — I'm wondering who calls this model like a spaceship destroyer.
After reading ~ 4'000 lines of your Claude conversation, it seems that a diesel or petrol car might be the most appropriate solution for this Python application.
Yeah, anyone who’s used LLMs for a while would know that this conversation is a lost cause and the only option is to start fresh.
But, a common failure mode for those that are new to using LLMs, or use it very infrequently, is that they will try to salvage this conversation and continue it.
What they don’t understand is that this exchange has permanently rotted the context and will rear its head in ugly ways the longer the conversation goes.
I’ve found this happens with repos over time. Something convinces it that implementing the same bug over and over is a natural next step.
I’ve found keeping one session open and giving progressively less polite feedback when it makes that mistake it sometimes bumps it out of the local maxima.
Clearing the session doesn’t work because the poison fruit lives in the git checkout, not the session context.
I don't think it's intended as that kind of binary. It's more like "yeah, it's flawed in that way, and here's how you can get around that". If someone's claiming the tool is perfect, they're wrong; but if someone's repeatedly using it in the way that doesn't work and claiming the tool is useless, they're also wrong.
Nobody said that. But as you say, it's just a tool. Tools need to be used correctly. If tools are unintuitive, maybe that's due to the nature of the tool or due to a flaw in it's design. But either way, you as the user need to work around that if you want to get the maximum use out of the tool.
I find myself wondering about this though. Because, yes, what you say is true. Transformer architecture isn’t likely to handle negations particularly well. And we saw this plain as day in early versions of ChatGPT, for example. But then all the big players pretty much “fixed” negations and I have no idea how. So is it still accurate to say that understanding the transformer architecture is particularly informative about modern capabilities?
I use an LLM as a learning tool. I'm not interested in it implementing things for me, so I always ignore its seemingly frantic desires to write code by ignoring the request and prompting it along other lines. It will still enthusiastically burst into code.
LLMs do not have emotions, but they seem to be excessively insecure and overly eager to impress.
This is because LLMs don't actually understand language, they're just a "which word fragment comes next machine".
Instruction: don't think about ${term}
Now `${term}` is in the LLMs context window. Then the attention system will amply the logits related to `${term}` based on how often `${term}` appeared in chat. This is just how text gets transformed into numbers for the LLM to process. Relational structure of transformers will similarly amplify tokens related to `${term}` single that is what training is about, you said `fruit`, so `apple`, `orange`, `pear`, etc. all become more likely to get spat out.
The negation of a term (do not under any circumstances do X) generally does not work unless they've received extensive training & fining tuning to ensure a specific "Do not generate X" will influence every single down stream weight (multiple times), which they often do for writing style & specific (illegal) terms. So for drafting emails or chatting, works fine.
But when you start getting into advanced technical concepts & profession specific jargon, not at all.
But they must have received this fine-tuning, right?
Otherwise it's hard to explain why they follow these negations in most cases (until they make a catastrophic mistake).
I often test this with ChatGPT with ad-hoc word games, I tell it increasingly convoluted wordplay instructions, forbid it from using certain words, make it do substitutions (sometimes quite creative, I can elaborate), etc, and it mostly complies until I very intentionally manage to trip it up.
If it was incapable of following negations, my wordplay games wouldn't work at all.
I did notice that once it trips up, the mistakes start to pile up faster and faster. Once it's made a serious mistakes, it's like the context becomes irreparably tainted.
reply