Hacker Newsnew | past | comments | ask | show | jobs | submit | c0rruptbytes's commentslogin

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes


This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.


> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.


A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context

This setup is extremely optimized down to the last flag. Changing any param from temp and below craters performance.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.


That’s useless without describing WHY you chose those flags, and how you did the optimisation…

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.


That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.


IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.


FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.


Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.


This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...


I'm talking about automation generally, not agent loops.

E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.

Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).

Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.

Try asking Qwen to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)

Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)

Applies to other rule following as well in my experience.

Qwen may be better at toolcalling


On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'


Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.


In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.


Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.


Qwen 3.6 35b a3b is about as good as sonnet 4.5. It varies but it's at that level.

Quantized Gemma 4 26B is as smart or better than GPT 5 in most of my testing. Granted GPT 5 is nearly a year old at this point, but I can run Gemma 4 on a ~6 year old consumer GPU (RTX 3090) and get 140 t/s.

It is smart enough that I use for all my coding tasks, and a lot of other mundane tasks.

It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.

My suggestion is to simply try it and see what it feels like.


Its not going to be as good as Claude, but if you know what you're doing, it may be good enough to get your work done.

This is task dependent.

I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.


A highly skilled carpenter may be able to 'get work done' by banging nails in with a heavy-bottomed cocktail glass, doesn't mean it's not painful to do so when it is continuously breaking and leaving shards of glass all over the workshop for you to find every day for the rest of your life until you clean up the mess you made using the wrong tool for the job.

More like, a highly-skilled carpenter can work miracles with a $6 hammer from the hardware store, while the pros on the commercial crew are using fancy compressed-air tools.

The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.


I'm talking about the common use case that I think hacker news people have:

you get a macbook for work, you run the macbook

they're not going to start giving GPUs to employees to run local models


What counts as a lot of memory? What could someone do with 16 GB of RAM?

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.


Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

gemma 12B 4bit quant; try something with MTP and an AWQ quant

gemma runs pretty well

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.


Minimax M3 too, and huawei claims to be releasing non-nvidia dependent training software too. openPangu 2.0 could be a shake-up if it holds up as a good model

China may not care about open source, but they know they will personally fund AI through government investments while US relies on private investments, best way to scare private investments is a free capable alternative for everyone

Add on the fact that they actually invested in energy infrastructure and can offer AI very cheap to their citizens and you can get a population well versed in AI to reduce menial tasks and focus on more productive things (if we're to believe the claims of the technology)


i’m running m4 pro 48gb right now

omlx + gemma 12b 6 bit + pi

it’s feasible for sure

MoEs for speed (qwen 35b, cohere 30b, gemma 26b)

Dense for more methodical work (qwen 27b [reigning champ], gemma 31b, gemma 12b)

MoE i recommend 5bit+

Dense i think 4 bit is okay

Play with your context size, you don’t really need that much, have lazy loading for tools and mcps

my pi extensions for anyone looking for a skinny quick setup, i have use `--no-skills` right now too:

    "npm:pi-codex-goal",
    "npm:pi-simplify",
    "npm:pi-mcp-adapter",
    "git:github.com/elpapi42/pi-minimal-subagent",
    "npm:@wierdbytes/pi-statusline",
    "npm:@aliou/pi-guardrails",
    "npm:pi-lens",
    "npm:@juicesharp/rpiv-todo",
    "npm:pi-hashline-readmap",
    "npm:@mrclrchtr/supi-review",
    "npm:pi-cmux",
    "npm:@mrclrchtr/supi-context",
    "npm:pi-tool-search"

think of local models as "zero sugar" models and that's where we're at right now. I think it's crazy how good these models are compared to last year's frontier models

it does not result in great results left unattended, it’ll start creating slop or hardcoding solutions

but overtime if you adjust your verification rubric, it’s not too bad, gets pretty good, if you do make it do TDD, it gets kinda crazy and you’ll have 2000-3000 tests after awhile, or on my common case, 6000-7000 lines of code in single files (i usually have a cron to audit files for decomposition and create tickets)

i wouldn’t use it at my job yet, but it’s been fun to use for personal projects - it’s like modded minecraft automation or factorio


Static analysis can help here! Add CI checks for duplicated code or file length.

For test growth, maybe use a coverage tracker and remove redundant tests?


There's `honcho` for memory, i'm starting to play with it now, but I feel like I've seen a lot of projects pop up for it

I like Zed...

but AI dev workflows get complicated fast

you start with claude code or codex and it's cute, but then you realize - hmm configuration is cheap, the AI can do it!

then you start looking into MCPs and skills, fuck it, oh-my-pi looks awesome!

wait a second? I can just have AI make my own personal AI harness! Next thing you know, you're writing the 5th version of "little-coder" or similar using the Pi library

ahh shit, you just read an article that `tools` are actually crazy important for AIs, using `sed` is dumb when `hashline` + ASTs are way better, lets just start writing our own tools!!

...anyway I just use Zed, simple agent on the left, code on the right

i have some pretty complicated automated workflows that use `linear` + a orchestrator -> implementer -> reviewer -> releaser workflow, but it's less a dev stack and an AI factory


seems more like a culture problem, i have my calendar very public, all my junior devs know ill get on a zoom with no hesitation and they actually seem to enjoy the screen sharing, every zoom is recorded with AI summary/transcript so they’re more focused on asking questions instead of taking notes (and i think they’re really solid juniors and actually go back and watch)

there’s the whiteboard element but i’ve gotten pretty good at exalidraw and zoom annotating

add in the remote makes it kinda easy to not be distracting in meetings so i can easily DM them context on the side to get them ramped up easier


Tossing in my two cents here to agree with you. I worked remotely on and off from about 2014 onward until post-COVID RTO brought me into an office for 18 months before I became remote again. During that time (and across a bunch of companies) I went from desktop support to senior sysadmin to security on the cusp of senior security engineer.

In my experience the biggest factor in teams usually came down to the middle management layer. If their "style" was "watch over your shoulder, butts in seats" type of micromanagement then juniors didn't tend to progress unless they were self motivated to seek it out.


I'm sure this is quite a personal thing. I much prefer being in-person for that kind of interaction, and I don't think it's about efficiency as such, I just prefer being around people despite not being an extrovert - hybrid working is perfect for me.


zoom settings fucking suck to set up full AI summary / transcript btw. i know it's a one time cost but it's across every engineer


we hired a few juniors at our fully remote company - no issue

this is ft trying to help their real estate portfolio


AI doesn’t have a model or agent availability problem to be fair, it does have a positive outreach problem and pewdiepie can do extremely well there

just my 2c


okay i watched his video on it, i can definitely abide.


ideally if ternary models work, the math is extremely easy for computers (addition/subtraction vs 16 bit multiplication)


Not quite as I understand it. The ternary approach bonsai uses leverages a FP16 scaling factor that each value in the ternary maps to. You're still using 16 bit multiplication, it's just that the weights are far more compressed.


fair, i think i was referring more to 1.58 bit architecture in general since the original paper (Figure 3) shows that we eliminate FP16 multiplication and addition just for INT8 addition. I need to dive deeper into bonsai overall if it differs

https://arxiv.org/pdf/2402.17764


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: