I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.
All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.
But it’s actually not so difficult is it? The simplest way to make a slow pool is by having fewer GPUs and queuing requests for the non-premium users. Dead simple engineering.
Oh, of course. That’s just conspiratorial thinking. Paying to be in a premium pool makes sense, all of this “they probably serve rotten food to make people pay for quality food” nonsense is just silly.
What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.
It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.
Where on earth are you getting these numbers?
Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.
That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.
They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.
Are you at all familiar with the architecture of systems like theirs?
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.
I'm building my take on a low-touch task completion assistant designed to counter distraction and hyper-habituation.
It's starting off as a MacOS app because that's the machine I have. I didn't know Swift or SwiftUI when I started. I now know them somewhat, but the entire app has been vibe-coded. This has made it slow going. Very "1 step forward 2 steps back" until I switched from Claude Code to Codex and GPT-5.
I'm hoping to start an initial beta within the family in the next week or two, and then a wider round in January.
This is already happening, and through a technique that copyright law does not really protect against. Writers of genre fiction are already reporting that e-books are being run through an LLM to completely rephrase it, and the result sold under by somebody else under a different title and author. This is easily automated.
This seems like it would be a copyright violation? The result bears substantial similarity to the input, even though it doesn’t have the particular words in common, right?
Like, if you translated the Spanish version to English, you’d have different words than the official English version, but it would still be a copyright violation to sell that, right?
Likewise if you first had someone do a translation from English to Spanish before you translated it back to English?
If it is based on an existing copyrighted work, bears substantial similarity to it, and competes with the original in the market, I thought copyright handled that?
The key lies in how easily the process is automated. Once a certain amount of freshly published ebooks are getting rephrased and sold by someone else, authors would be playing whack-a-mole with copyright claims, and they might not even become aware of all the copies of their work out there.
Are these numbers full time employees only or total FTEs? Because it mentions Walmart: "Walmart’s full-time employees number remained relatively constant for the last 10 years".
Would revenue / person-hour show a different trend? Because there are a lot of part-time and contract workers out there.
You might be how surprised how low a dose you need for an effect. 5-10ug of Ritalin noticeably reduces the "noise floor" for me.
How do you take 5-10ug? Dissolve 10mg in a litre of something. Get a 1ml dosing syringe. It has 0.1ml markings.
You could start there and increase it until you find what works. Also, if you take very little you can have a break on weekends and not suffer too much while remaining sensitive to lower dosages.
All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except as permitted by U.S. copyright law.
“Reproduced” and “electronic” are the relevant terms here.
I remember when gpt-3 came out and you could get it to spit out chunks of Harry Potter and I wondered why no-one was being sued.
The models are built on copyright infringement. Authors and publishers of any kind should be able to opt out of being included in training data and ideally opt-in should be the default.
And I hope one day someone trains a model without the use of works of fiction and we find a qualitative difference in their performance. Does a coding model really need to encode the customs, mores and concerns of Victorian era fictional characters to write a python function?
i've been testing all models that fit the mac studio 512 gb ever since i got it. previously i was mostly focused on getting tool use and chain of thought fine-tuning for coding, around the size of llama 3.2 11b. but even some distill r1s on llama 3 70b run well on macbooks, although quite slow compared to a regular api call to the closed models.
for mac studios i've found the sweet spot to be the largest gemma, up until llama scout was released, which fits the mac studio best. scout, although faster to generate, takes a while longer to fill in the long context, basically getting the same usability speeds as with the qwq or gemma 27b.
the refactoring is a test driven task that i've programmed to run by itself, think deep research, until it passes the tests or exhausts imposed trial limits. i've wrote it by instructing gemini, r1 and claude. in short, i've made gemini read and document proposals for refactoring, based on the way i code and strict architectural patterns that i find optimal for projects that handle both an engine and some views such as the react.js views that are present in these vscode extensions.
gemini pro gets it really well and has enough context capacity to maintain several different branches of the same codebase with these crazy long files without losing context. once this task is completed, training a smaller model based on the executed actions, (by that i mean all the tool use: diff, insert, replace and most importantly, testing) to perform the refactoring instructions is fairly easy.
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
reply