More

jawon · 2026-02-07T21:41:21 1770500481

I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.

Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.

Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.

Aurornis · 2026-02-07T22:20:04 1770502804

> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504

LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.

The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.

This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.

stavros · 2026-02-07T22:12:14 1770502334

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.

brookst · 2026-02-08T00:19:21 1770509961

All of these systems use massive pools of GPUs, and allocate many requests to each node. The “slow it down” knob is to steer a request to nodes with more concurrent requests; “speed it up” is to route to less-loaded nodes.

stavros · 2026-02-08T00:25:44 1770510344

Right, but that's still not Anthropic adding an intentional delay for the sole purpose of having you pay more to remove it.

yunohn · 2026-02-08T09:58:45 1770544725

But it’s actually not so difficult is it? The simplest way to make a slow pool is by having fewer GPUs and queuing requests for the non-premium users. Dead simple engineering.

stavros · 2026-02-08T10:17:52 1770545872

No, the simplest way is `sleep(10)`.

yunohn · 2026-02-08T16:40:48 1770568848

No, that wastes actual GPU resources and money. The method I described saves them money on the cheaper pool.

Regardless, I sense you’re being sarcastic and difficult, so I have no desire to discuss this further.

brookst · 2026-02-08T01:37:41 1770514661

Oh, of course. That’s just conspiratorial thinking. Paying to be in a premium pool makes sense, all of this “they probably serve rotten food to make people pay for quality food” nonsense is just silly.

landl0rd · 2026-02-08T04:03:55 1770523435

What they are probably doing is speculative decoding, given they've mentioned identical distribution at 2.5x speed. That's roughly in the range you'd expect for that to achieve; 10x is not.

It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.

crowbahr · 2026-02-07T22:08:27 1770502107

Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?

This is such bizarre magical thinking, borderline conspiratorial.

There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.

jawon · 2026-02-07T22:17:52 1770502672

Not magical thinking, not conspiratorial, just hypothetical.

Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.

And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.

jawon · 2026-02-07T22:42:40 1770504160

Also, from a comment in another thread, from roon, who works at OpenAI:

> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow

[0] https://nitter.net/tszzl/status/2016338961040548123

falloutx · 2026-02-07T22:09:46 1770502186

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

Aurornis · 2026-02-07T22:21:31 1770502891

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

falloutx · 2026-02-07T22:27:11 1770503231

They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

Aurornis · 2026-02-07T22:28:36 1770503316

Do you have any evidence that this is happening? Or is it just a hypothetical threat you're proposing?

These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.

falloutx · 2026-02-07T22:34:05 1770503645

They have contracts with companies, and those companies wont be able to change quickly. By the time those contracts will come back for renewals it will already be too late, their code becoming completely unreadable by humans. Individual devs can move quickly but companies don't.

kolinko · 2026-02-07T22:32:09 1770503529

Are you at all familiar with the architecture of systems like theirs?

The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.

falloutx · 2026-02-07T22:35:53 1770503753

I am familiar with the business model. This is clear indication of what their future plan is.

Also, I just pointed out at the business issue, just raising a point which was not raised here. Just want people to be more cautious

blackqueeriroh · 2026-02-08T02:10:19 1770516619

So you are not familiar with the system architecture. Okay.

throw310822 · 2026-02-07T22:12:57 1770502377

Slowing down respect to what?

falloutx · 2026-02-07T22:15:45 1770502545

Slowing down with respect to original speed of response. Basically what we used to get few months back and what is the best possible experience.

throw310822 · 2026-02-07T22:20:42 1770502842

There is no "original speed of response". The more resources you pour in, the faster it goes.

falloutx · 2026-02-07T22:26:20 1770503180

Watch them decrease resources for the normal mode so people are penny pinched into using fast mode.

throw310822 · 2026-02-07T22:47:00 1770504420

Seriously, thinking at the price structure of this (6x the price for 2.5x the speed, if that's correct) it seems to target something like real time applications with very small context. Maybe vocal assistants? I guess that if you're doing development it makes more sense to parallelize over more agents rather than paying that much for a modest increase in speed.

jawon · 2025-12-13T22:52:16 1765666336

This is “why are we going to space when we haven’t cured cancer” reasoning.

jawon · 2025-11-10T04:05:04 1762747504

I'm building my take on a low-touch task completion assistant designed to counter distraction and hyper-habituation.

It's starting off as a MacOS app because that's the machine I have. I didn't know Swift or SwiftUI when I started. I now know them somewhat, but the entire app has been vibe-coded. This has made it slow going. Very "1 step forward 2 steps back" until I switched from Claude Code to Codex and GPT-5.

I'm hoping to start an initial beta within the family in the next week or two, and then a wider round in January.

jawon · 2025-11-05T23:42:01 1762386121

Piracy might not hurt sales, but 1000 publishers putting out their own copies of your book/game/song/poster/miniature once it hits the market will.

That's why I can accept copyright even thought it's not perfect.

HeinzStuckeIt · 2025-11-06T00:38:39 1762389519

This is already happening, and through a technique that copyright law does not really protect against. Writers of genre fiction are already reporting that e-books are being run through an LLM to completely rephrase it, and the result sold under by somebody else under a different title and author. This is easily automated.

drdeca · 2025-11-06T00:57:41 1762390661

This seems like it would be a copyright violation? The result bears substantial similarity to the input, even though it doesn’t have the particular words in common, right?

Like, if you translated the Spanish version to English, you’d have different words than the official English version, but it would still be a copyright violation to sell that, right? Likewise if you first had someone do a translation from English to Spanish before you translated it back to English?

If it is based on an existing copyrighted work, bears substantial similarity to it, and competes with the original in the market, I thought copyright handled that?

HeinzStuckeIt · 2025-11-06T01:07:34 1762391254

The key lies in how easily the process is automated. Once a certain amount of freshly published ebooks are getting rephrased and sold by someone else, authors would be playing whack-a-mole with copyright claims, and they might not even become aware of all the copies of their work out there.

jawon · 2025-11-03T03:26:36 1762140396

Are these numbers full time employees only or total FTEs? Because it mentions Walmart: "Walmart’s full-time employees number remained relatively constant for the last 10 years".

Would revenue / person-hour show a different trend? Because there are a lot of part-time and contract workers out there.

miguelazo · 2025-11-03T04:10:15 1762143015

Excellent point— contractor and consultant headcount’s have ballooned at many firms. Lots of technical debt being taken on to make short term profits.

jawon · 2025-09-01T03:43:49 1756698229

Guanfacine.

xianwen · 2025-09-01T20:39:03 1756759143

Yes. That's the one! Thank you!

jawon · 2025-09-01T03:43:34 1756698214

I found it made me think and act sleep deprived even though I didn't feel it and also increased anxiety.

Guanfacine is also an alternative, and it's method of action also makes it anxiety reducing.

jawon · 2025-09-01T02:43:28 1756694608

You might be how surprised how low a dose you need for an effect. 5-10ug of Ritalin noticeably reduces the "noise floor" for me.

How do you take 5-10ug? Dissolve 10mg in a litre of something. Get a 1ml dosing syringe. It has 0.1ml markings.

You could start there and increase it until you find what works. Also, if you take very little you can have a break on weekends and not suffer too much while remaining sensitive to lower dosages.

jawon · 2025-05-12T04:48:30 1747025310

This is a standard book copyright notice:

All rights reserved. No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except as permitted by U.S. copyright law.

“Reproduced” and “electronic” are the relevant terms here.

I remember when gpt-3 came out and you could get it to spit out chunks of Harry Potter and I wondered why no-one was being sued.

The models are built on copyright infringement. Authors and publishers of any kind should be able to opt out of being included in training data and ideally opt-in should be the default.

And I hope one day someone trains a model without the use of works of fiction and we find a qualitative difference in their performance. Does a coding model really need to encode the customs, mores and concerns of Victorian era fictional characters to write a python function?

yieldcrv · 2025-05-12T07:11:38 1747033898

> except as permitted by U.S. copyright law.

these are the relevant terms to me, that notice isn’t law at all, where the exceptions make the rule.

jawon · 2025-04-12T20:06:19 1744488379

What size gemma are you using? Is the refactoring running independently or managed by you?

ach9l · 2025-04-12T22:14:11 1744496051

i've been testing all models that fit the mac studio 512 gb ever since i got it. previously i was mostly focused on getting tool use and chain of thought fine-tuning for coding, around the size of llama 3.2 11b. but even some distill r1s on llama 3 70b run well on macbooks, although quite slow compared to a regular api call to the closed models.

for mac studios i've found the sweet spot to be the largest gemma, up until llama scout was released, which fits the mac studio best. scout, although faster to generate, takes a while longer to fill in the long context, basically getting the same usability speeds as with the qwq or gemma 27b.

the refactoring is a test driven task that i've programmed to run by itself, think deep research, until it passes the tests or exhausts imposed trial limits. i've wrote it by instructing gemini, r1 and claude. in short, i've made gemini read and document proposals for refactoring, based on the way i code and strict architectural patterns that i find optimal for projects that handle both an engine and some views such as the react.js views that are present in these vscode extensions.

gemini pro gets it really well and has enough context capacity to maintain several different branches of the same codebase with these crazy long files without losing context. once this task is completed, training a smaller model based on the executed actions, (by that i mean all the tool use: diff, insert, replace and most importantly, testing) to perform the refactoring instructions is fairly easy.