Hacker Newsnew | past | comments | ask | show | jobs | submit | zozbot234's commentslogin

> The K-V cache of a model is intertwined with the model's configuration.

I don't think this is true if you simply quantize the model or run it with fewer active experts? The underlying weights would stay the same. You could also play further tricks with skipping some of the model's middle layers outright, which works surprisingly well due to how skip connections are used.


That would be about ~300 tok/s over 72 hours at Claude Fable output token prices? I'm not sure that this passes a sanity test.

Subagents are a helluva drug.

> First, there's been a huge amount of widely accepted research that shows what the most accessible way to design an interface is.

Focus-group based and UX research was a lot more intense in the 1990s compared to today, and late 1990s UIs are still among the best available.


DeepSWE has been heavily criticized though. https://github.com/datacurve-ai/deep-swe/issues/21 Putting GPT 5.5 on top is the obviously correct part, but everything else about it makes very little sense.

The DS4 author has demoed upcoming work on Strix Halo that makes it roughly competitive with the Apple Silicon equivalent (i.e. Pro models with similar memory bandwidth figures, not Max or Ultra). Maybe even a bit faster for prefill, and with further potential for running small batches in parallel (since the GPU clearly has some amount of compute headroom during decode).

As far as I can tell you'll have a context limit of about 64k, which is also prohibitive for serious work. (My benchmark maxes out at 90k in context when running, so I'm giving the self-hosted models 128k to leave plenty of wiggle room.)

But, still, it's cool that the work is happening. For some classes of problem it might be an option, and when the 192GB Strix Halo comes out, DS4 will probably become a real contender for self-hosting champ, as that leaves enough memory for a big context.


> As far as I can tell you'll have a context limit of about 64k

Source? The author has demoed a 100k ctx already, and I can't think of a reason why more wouldn't be supported. RAM is a bit tight but that only matters with really long contexts on DeepSeek V4, and proper support for SSD streaming would address this anyway.

BTW, the official support is now merged too.


OK, I just tried it with the new mainline ROCm and MTP support, and it is faster, but still uncomfortably slow for interactive coding agent use. It does about 14-15 t/s, which is faster than the 10-11 t/s I was seeing before, but still a crawl. I set it loose on a small 300-line Perl file, and it's still chewing several minutes later.

So, it's super cool that such a solid model can run locally and it's probably useful for batched work overnight. But, I'm not going to sit around twiddling my thumbs while working. I think I can write code by hand faster than this. I'll gladly pay for a cloud model so I don't have to wait (especially since DeepSeek models are so cheap).


Well, that performance figure seems consistent with memory bandwidth on that machine (and its upcoming successor Gorgon Halo; Medusa Halo is projected to be faster) and even on DGX/RTX Spark. You'd get the same outcome on Apple Silicon Mn Pro (not Max or Ultra) if there was one with enough memory capacity. It's likely possible to raise aggregate tok/s on Strix Halo or DGX/RTX Spark (not realistically on Apple Silicon, at least not on a single machine) by batching multiple inference flows together, but that's admittedly a bit fiddly to implement and not what you're interested in anyway.

It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.


No source, just back of the envelope math. 100k seems optimistic, but I guess I'll try it and see. That would be usable for at least a few use cases, including the security scanning work I'm focused on at the moment (at least, so far, the peak token usage has been 90k, which would make 100k tight but probably fine).

It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.

The expensive tokens are output, not input. A useful rule of thumb is that a million tokens per day means about ~10 tok/s on a 24/7 basis.

Even then, i highly doubt any sort of automation is producing on the order of several millions of tokens daily. The issue I see with the org in parent comment seems to stem from management and not any sort of token repricing.

I can't say more. But it is totally possible.

Absolutely. The cost comparison is roughly between DeepSeek and Haiku (assuming a reputable Western provider, not DeepSeek's own API) whereas the average capabilities sit comfortably above Sonnet.

Not a great example: Qwen 9b is a tiny model that outputs barely coherent text in a casual chat, nowhere near comparable to Haiku. But the broader point stands.

I am not sure if you are testing qwen 3.5 9b. I would also verify that you are running it correctly. Qwen 3.5 9b is actually a very capable coding model that can do agentic coding albeit it’s obviously not as good as opus.

You can look up the benchmarks on that model as well. Your experience does not align with mine.


It's a lot easier to ride recklessly on a motorcycle than an ordinary bike. I suppose mopeds/motor scooters (especially electric ones) are the sensible middle-of-the-road option.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: