Hacker Newsnew | past | comments | ask | show | jobs | submit | plufz's commentslogin

I think you should consider the possibility that people who ”don’t mention it” does so because we don’t see the science pointing in that direction.

Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/

I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.


here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.

  /Users/gcr/llama.cpp/build/bin/llama-server
      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
      --no-mmproj-offload
      --fit on
      -c 65536 # edit to taste
      --reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.

For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.

You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).

Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc

Take backups and then go have fun. Hope this helps.


I have a 5070TI (16gb VRAM) with 32GB system ram and a 16 core AMD cpu. I am considering buying a second used videocard, probably the same model, but not for months yet. This hardware setup is new-for-me in that a buddy gave me most of it and I bought the TI card.

Are there any resources to help me figure out how to best optimize my runtime paramaters for a given model, based on a given task, similar to what you've shown?

I've been a little... irritated? that hooking vscode up to my company LLM subscription seems so much more out-of-the-box capiable than what I can get to work. My assumption at the moment is that I need to create a lot of... I think they're called harnesses? agents? workflows? integrations? (not sure) by hand. Is that accurate?

Right now I have ollama running an nvidia nano model and I can poke it with a stick over a web interface I installed. It works, initial token response is slow, after that it seems fine enough.

I can't seem to get a good handle on how much context I've used, when context usage starts to degrade response accuracy, or in general how to mirror the results I get (not in terms of accuracy or speed, just features) from the company github copilot + vscode integration.

I was also trying to get a plugin called qodeassist working via qtcreator, mixed results there as well.

I've been keeping up with this space since the jump, never paid for a sub, work gave me a sub a handful of weeks ago, so the actual useage is all new to me.

I can't say I'm super impressed with any of it relative to the hype, but I found it neat to be able to point vscode at a c++ codebase and say "enable wextra, build the code, tell me if there is any low-hanging fruit I can clean up" and get a useful response.

I also asked my local model to turn a picture of my dog into a picture of an otter, got a blank picture back, which the thinking bit told me it would do. The whole thing was actually kind of funny. "I am allowed to edit pictures, I can't edit pictures, I am allowed to edit pictures, I'll tell the user I did and send a blank picture back because I can't edit pictures, but I am allowed to."


Can you elaborate more on the differences in running ollama or lmstudio? Do they actually slow down the speed of the inference and if so why? Or is it just a preference thing?

Ollama and LM-Studio are fine. Their main advantage is that they have a nice way to browse models -- LMStudio from huggingface and Ollama from their own curated list. Both are great ways of getting started. Pick LM-Studio if you'd like a nice GUI frontend to mlx-lm or llama-cpp; pick ollama if you'd like a nice command line interface and don't need non-default parameters.

LM-Studio doesn't support certain parameter combinations. For instance, LM-Studio supports KV quantization....but if you're using the MLX backend, you can't set the context length when KV quantization is used? Why? Running a model with certain settings requires keeping a little SAT solver going in your head. I found that overwhelming, so I just stopped using it.

The Ollama devs want to offer a central curated experience, but I perceive their approach as "playing fast and loose." They've re-implemented unique code for every model they support in their own Go runtime, so certain parameter choices aren't supported. On my hardware, their MLX backend just doesn't work at all without segfaulting the server process for example. It doesn't smack as vibe coded the way oMLX does, but it also doesn't smack as professional or battle-tested.

Ultimately, just dropping down to llama-cpp's GGUF model support and asking for default settings has provided faster inference speeds than anything I've been able to benchmark with them, but everything's within 10% of each other anyway so it's not a huge deal for me.


Thank you, that makes a lot of sense

Thanks a million!

Yeah Apple is smart enough to know that large conspiracies tend to leak sooner or later. Your solution would absolutely have been my choice if I wanted to slow down old iPhones, and I’m quite sure the leadership at Apple are smarter than me.

But wasn’t that just a setting changing it to default to fn? It was some time since I last used them…

> But wasn’t that just a setting changing it to default to fn

They relented and added this settings two years in, IIRC. It wasn't there from the start


Don’t use those expensive escrow services, much cheaper to keep your own crows!

I used to be mostly at high/xhigh but now at medium I think it actually performs quite well both on results and token usage.


Does gemma work better than qwen3 in your experience?


Not in mine. I see a lot of people talking about Gemma on here but in my circles pretty much everyone else is running qwen.


You and I have very different ideas of dystopia.


Personally I enjoy the basic human rights of privacy and freedom of speech which are deeply lacking in the UK system.


That I can agree on. The right for children to by energy drinks is something different for me.


Other people enjoy their children not being shot.


Both systems can be bad for different reasons. I'm not making any comparisons.


I also have things I want to change in gatekeeper, but that feature is not one of them. Just gut feeling but I would say 110% of all users, would just click ”start” on every unsigned app if it was that easy.


Bingo. I know I would.

I am the king of knowing immediately when I have fucked up.

“Undo” has made us far too comfortable with mistakes.


they could do it like they do it for accessibility settings. you have to opt in for an app and you need to know damn well if it is a reputable app before giving those controls over. there's enough friction in that that it is not done by many apps but not hard enough that it's a huge ask to whitelist the app.


So have a buried option that power users can flip one time to add an allow button to opening untrusted apps.


But that's exactly what `sudo spctl --master-disable` does! You'll still see a warning dialog on first launch.


So you don't lose any of the protections, just are allowed the option of running anyway (or backing out and NOT running it after getting the warning)?


I don’t understand what you mean by “protection”. The “protection” offered by Gatekeeper is that you aren’t able to run unsigned software without going into System Preferences. That’s it. There isn’t some other secret sauce.

Without Gatekeeper, macOS will instead pop up a dialog warning you that the application was downloaded from the internet, and provide an option to run it anyway, on first launch.


That’s good to know, but the spelling of the command is incredibly user hostile, even by modern apple standards.


> the spelling of the command is incredibly user hostile

Well the command is spctl, so I assume it stands for (s) Security (p) Policy (ctl) Control.

I agree that "ctl" for "control" is a bit weird but it's a pretty typical Unix convention: pfctl, networkctl, systemctl, etc.


See previous days articles. Agentic coding. Going from 1b annual commits to estimated 14b or more from one year to another.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: