Hacker Newsnew | past | comments | ask | show | jobs | submit | proxysna's commentslogin

“Expert” that does not know what a Terraform is. lol, lmao even

You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.

From Qwen3.6 page:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0


min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.

set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.

There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)


I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?

In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.

We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.

Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".


Cool, i am mostly a plumber for these things, but do you have any sort of reading that i can go through to wrap my head around it to some degree?

Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.

I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.

Gist with the compose and example of an output. https://gist.github.com/meaty-popsicle/f883f4a118ff345b430c3...


Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.


What would be these additional vllm flags, if you don't mind sharing?


This is from an example from my Nomad cluster with two a5000's, which is a bit different what i have at work, but it will mostly apply to most modern 24G vram nvidia gpu.

"--tensor-parallel-size", "2" - spread the LLM weights over 2 GPU's available

"--max-model-len", "90000" - I've capped context window from ~256k to 90k. It allows us to have more concurrency and for our use cases it is enough.

"--kv-cache-dtype", "fp8_e4m3", - On an L4 cuts KV cache size in half without a noticeable drop in quality, does not work on a5000, as it has no support for native FP8. Use "auto" to see what works for your gpu or try "tq3" once vllm people merge into the nightly.

"--enable-prefix-caching" - Improves time to first output.

"--speculative-config", "{\"method\":\"qwen3_next_mtp\",\"num_speculative_tokens\":2}", - Speculative mutli-token prediction. Qwen3.5 specific feature. In some cases provides a speedup of up to 40%.

"--language-model-only" - does not load vision encoder. Since we are using just the LLM part of the model. Frees up some VRAM.


> "--speculative-config",

Regarding that last option: speculation helps max concurrency when it replaces many memory-expensive serial decode rounds with fewer verifier rounds, and the proposer is cheap enough. It hurts when you are already compute-saturated or the acceptance rate is too low. Good idea to benchmark a workload with and without speculative decoding.


Thank you!

Just curious, what's your setup like? How do the devs interact with the model?


OpenWebUI with postgres and vllm for inference, searxng for websearch a few other mcp's for tools.


question: why not use something like Claude? is it for security reasons?


Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?


I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.

But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.

So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.


We do make Claude and Mistral available to our developers too. But, like you said, security. I, personally, do not understand how people in tech, put any amount of trust in businesses that are working in such a cutthroat and corrupt environment. But developers want to try new things and it is better to set up reasonable guardrails for when they want to use these thing by setting up a internal gateway and a set of reasonable policies.

And the other thing is that i want people to be able to experiment and get familiar with LLM's without being concerned about security, price or any other factor.


Because it's a great tool and the second it's not we can just do what you're doing :)


What a great write up, and a video too! Even though Minecraft stuff ofc was a bit of a bait, it would be interesting see the answer to "Can it run Doom?".


From the article:

Only 40,960 words of memory. That’s only 90kb total memory to split between our code and the memory it needs at runtime.

Looking at a copy of Doom on the Internet Archive (https://ia800404.us.archive.org/view_archive.php?archive=/15...), DOOM.EXE is about 709k, and DOOM.WAD is about 11159k.

I think that's a pretty solid no.

Also it's a 250khz CPU. Not megahertz. Kilohertz. It's slower than the 1MHZ 8-bit home computers like the Apple ][ or c64.

"Running" Doom might be possible with some insane hack that offloads storage and/or processing to more modern hardware crammed into the UNIVAC case but given that this is one of two UNIVACs in the entire world, and the only one that actually runs, I don't think the museum is gonna let anyone cram a Raspberry Pi up in there.


> a bit of a bait

"a bit" is doing a lot of work there. It was absolute nonsense. They were no closer to running a Minecraft server than I am to running UKGOV.


They hosted a program that allowed minecraft clients to connect... I'd class that as a minecraft server, even if it wasn't a very good one


> They hosted a program that allowed minecraft clients to connect...

Connect in the sense of receiving a login packet and saying "yes". That's it. Steps 1, 2, 3, 9, 10 of [0] (they didn't mention encryption or compression, I'm assuming they didn't implement it.)

They didn't mention anything about any of the steps past 10 - again, assuming they didn't implement them.

It's a trivial thing they've implemented - good work, sure, but a Minecraft server? Absolutely not.

[0] https://minecraft.wiki/w/Java_Edition_protocol/FAQ#What's_th...?


Not enough dedotated wam for all that.


Yeah, my thought exactly, execution lacked, but i do admire the attempt.


Not Doom, but a ZMachine interpreter might run with:

- Zork I-III

- Calypso

- Tristam Island

- All the Z3 machine games at IF archive

- The rest of Infocom propietary games

https://www.ifwiki.org/List_of_Z-machine_interpreters

Also: https://ifdb.org/viewgame?id=lkr2jf03np19ieix

Now, if the game was libre software it could be improved and ported to Puny Inform (a 'lite' version of Inform6 tuned for smaller machines) creating a really small Z3 file being able to play it from the PDP10 and 8 bit microcomputers to anything from today. From smartphones to PDA's to GNU/Linux with Frotz to Winfrotz and Lectrote and Fabularium for Android/Mac and iOS.

So, 'does it run Doom'? Man, you can play Zork in a pen with writting detection. How cool is that?


It could probably run the code for doom, once recompiled for the risc-v emulator, but given that the only output is a paper teletype, displaying it would be a problem


> but given that the only output is a paper teletype, displaying it would be a problem

You are in a maze of twisty passages, all alike. A cacodaemon floats by, hissing.


I wonder which would be faster: computing a frame, or printing it? If you could print one frame at a time, you could make a flip-book animation.


And given the NES emulator example, take half an hour per frame.


Feels kind of like when Usagi Eletric got "Doom" running on a vacuum tube computer with a teletype interface without support for even ASCII, but it was just an imitation of the background music.

Anything for the thumbnail.


What a load of e-waste. Nightmare.



Words have many meanings and we somehow don't change old ones because someone used them in dirty context.

https://en.wikipedia.org/wiki/Docking_and_berthing_of_spacec...


Haha, I am not native speaker as you can imagine. The idea here is more of a space ship/rocket docking into a base.


Hehe, yeah i understand, just something that popped up. Great project!


Pleasant thing about routers that is is so simple to build one after learning basics of networking and pretty much any OS or distro can act as one. There are obvious choices like OPN\PFSENSE, OpenWRT, DD-WRT, FreshTomato, but literally any PC with a single Ethernet port can act as one. My favorite setup was a laptop running Ubuntu and the whole router setup was in a single netplan file + dnsmasq for DHCP.

Edit: And ofc best cheap device imo is OrangePI R1 LTS and a whatever usb wifi dongle. Came in clutch a few times, such a nice little device.


Been using DD-WRT for years. Current setup is a $50 Dell Optiplex i5 from ebay running x86 DD-WRT. I put an intel 4x 1Gbit NIC in it, and it's been an excellent router for years.


I remember having a "Chuck" plugin installed on our Jenkins back in mid 2010's. Gave me a Chuckle every time i forgot it was there.


it's mostly OmniOS/SmartOS and other Illumos (descendant of OpenSolaris) distributions. All the Solaris 11 deployments i was aware of in mid-late 2010s are now either migrated to some sort of container setup of running on OmniOS.


Look into vxray it works for my wife's family. AmneziaVPN worked for me during my last visit too.


To my surprise, even sophisticated means of traffic masking like amnezia and vxray get disrupted frequently, requiring hopping around self hosted solutions and updating ones setup periodically. That's waaay beyond what most people are capable of. I am fortunate to have some tech worker acquaintance who live next to my family members, otherwise there'd be no way for me to for example guide them through setup and re-configuration remotely. Still, this setup gets disrupted every month or so requiring manual intervention.


Try to get a middle hop somewhere at a russian datacenter. Sometimes these have DPI censorship boxes disabled (?) -- I know one that lets me forward simple Wireguard from mobile routers to a EU server with a few SNAT/DNAT rules, even though ordinarily that would get blocked at first sight.

(Sadly, it's just Mikrotik gear that can't use any fancy censorship evasion protocols).


If it's arm based mikrotik then you could run containers with fancier protocol wrappers


I have 3x-ui installed in Netherlands and everything works fine so far.

But sure, they are trying their worst to block every channel of data exchange they can.


I would say they are trying to block every public VPN, and if some VPN tried to hide behind CloudFlare's backs thinking that they took all CloudFlare sites as hostages, then whole CloudFlare is nuked, and hostages did not save VPN from blocking.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: