Hacker Newsnew | past | comments | ask | show | jobs | submit | radarsat1's commentslogin

> The ai tooling reverses this where the thinking is outsourced to the machine and the user is borderline nothing more than a spectator, an observer and a rubber stamp on top.

I find it a bit rare that this is the case though. Usually I have to carefully review what it's doing and guide it. Either by specific suggestions, or by specific tests, etc. I treat it as a "code writer" that doesn't necessarily understand the big picture. So I expect it to fuck up, and correcting it feels far less frustrating if you consider it a tool you are driving rather than letting it drive you. It's great when it gets things right but even then it's you that is confirming this.


This is exactly what I said in the end. Right now you rely on it fucking things up. What happens to you when the AI no longer fucks things up? Sorry to say, but your position is no longer needed.

Additionally it's not like you're constrained to write it in bash. You could use Python or any other language. The author talks about how you're now redeveloping a shitty CI system with no tests? Well, add some tests for it! It's not rocket science. Yes, your CI system is part of your project and something you should be including in your work. I drew this conclusion way back in the days where I was writing C and C++ and had days where I spent more time on the build system than on the actual code. It's frustrating but at the end of the day having a reliable way to build and test your code is not less important than the code itself. Treat it like a real project.

> The "dump" on their end was to use this as marketing bait and a way to inflate their valuation.

Maybe a bit different but I think it's worth pointing out how this parallels the state of the job market right now.

It is so hard to get hired, with so many moving and diverse frameworks, libraries, and technologies you are expected to know, that it's almost impossible to keep up and stand out.

The only way to do it is to develop "projects" that demonstrate your abilities in each target domain, and in these days of vibe coding these need to be more than sketches but like full fledged applications that can draw real attention to you, if your lucky get on the front page somewhere.

And with vibe coding it can be done relatively quickly.

So we're in this state of new projects, very impressive looking projects, getting posted every day, all the time, and about 1% of them will see any kind of longevity because the vast majority will be dumped as soon as the author gets a job.

This makes it increasingly difficult to select dependencies for downstream work.


> Experts can be swapped in and out of VRAM for each token.

I've often wondered how much it happens in practice. What does the per-token distribution of expert selection actually look like during inference? For example does it act like uniform random variable, or does it stick with the same 2 or 3 experts for 10 tokens in a row? I haven't been able to find much info on this.

Obviously it depends on what model you are talking about, so some kind of survey would be interesting. I'm sure this must but something that the big inference labs are knowledgeable about.

Although, I guess if you are batching things, then even if a subset of experts is selected for a single query, maybe over the batch it appears completely random, that would destroy any efficiency gains. Perhaps it's possible to intelligently batch queries that are "similar" somehow? It's quite an interesting research problem when you think about it.

Come to think of it, how does it work then for the "prompt ingestion" stage, where it likely runs all experts in parallel to generate the KV cache? I guess that would destroy any efficiency gains due to MoE too, so the prompt ingestion and AR generation stages will have quite different execution profiles.


The model is explicitly trained to produce as uniform a distribution as possible, because it's designed for batched inference with a batch size much larger than the expert count, so that all experts are constantly activated and latency is determined by the highest-loaded expert, so you want to distribute the load evenly to maximize utilization.

Prompt ingestion is still fairly similar to that setting, so you can first compute the expert routing for all tokens, load the first set of expert weights and process only those tokens that selected the first expert, then load the second expert and so on.

But if you want to optimize for single-stream token generation, you need a completely different model design. E.g. PowerInfer's SmallThinker moved expert routing to a previous layer, so that the expert weights can be prefetched asynchronously while another layer is still executing: https://arxiv.org/abs/2507.20984


Thanks, really interesting to think about these trade-offs.

I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

I was thinking mainly about the standard AR loop, yes I can see that grammars would make it a bit more complicated especially when considering batching.

I've thought about saving my prompts along with project development and even done it by hand a few times, but eventually I realized I don't really get much value from doing so. Are there good reasons to do it?


For me it's increasingly the work. I spend more time in Claude Code going back and forth with the agent than I do in my text editor hacking on the code by hand. Those transcripts ARE the work I've been doing. I want to save them in the same way that I archive my notes and issues and other ephemera around my projects.

My latest attempt at this is https://github.com/simonw/claude-code-transcripts which produces output like the is: https://gisthost.github.io/?c75bf4d827ea4ee3c325625d24c6cd86...


Right, I get that writing prompts is "the work", but if you run them again you don't get the same code. So what's the point of keeping them? They are not 'source code' in the same sense as a programming language.


That's why I want the transcript that shows the prompts AND the responses. The prompts alone have little value. The overall conversation shows me exactly what I did, what the agent did and the end result.


> shows me exactly what I did

I get that, but I guess what I'm asking is, why does it matter what you did?

The result is working, documented source code, which seems to me to be the important part. What value does keeping the prompt have?

I'm not trying to needle, I just don't see it.


It's like issues in that it helps me record why I had the agent solve problems in a particular way.

It's also great for improving my prompting skills over time - I can go back and see what worked.


It's not for you. It's so others can see how you arrived to the code that was generated. They can learn better prompting for themselves from it, and also how you think. They can see which cases got considered, or not. All sorts of good stuff that would be helpful for reviewing giant PRs.


Sounds depressing. First you deal with massive PRs and now also these agent prompts. Soon enough there won't be any coding at all, it seems. Just doomscrolling through massive prompt files and diffs in hopes of understanding what is going on.


I suspect this future will not play out. Mitchell is definitely leaning to one side on this debate.

To me, quality code is quality code no matter how it was arrived at. That should be the end of it


Using them for evals at a future date.

I save all of mine, including their environment, and plan to use them for iterating on my various system prompts and tool instructions.


If the AI generated most of the code based on these prompts, it's definitely valuable to review the prompts before even looking at the code. Especially in the case where contributions come from a wide range of devs at different experience levels.

At a minimum it will help you to be skeptical at specific parts of the diff so you can look at those more closely in your review. But it can inform test scenarios etc.


Reminds me of ColdFusion. Don't recall having a great time using it, though I was very young at the time so maybe my memory is distorted on this.


I remember Cold Fusion quite well. You might have PTSD.


CF was the first thing I thought of when I read the title too.


Heh, at least this wouldn't spread emojis all over my readmes. Hm, come to think of it I wonder how much tokenization is affected.

Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.

Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.


Just have the models converse with each other?


Something I wonder is, what happened to asm.js? It got killed by WASM. In a way this is good, WASM is a "better" solution, being a formal bytecode machine description, but on the other hand, asm.js would not have the same limitations e.g. with respect to DOM interaction, or debates on how to integrate garbage collection, since you stay squarely in the JS VM you get these things for free.

Basically in some ways it was a superior idea: benefit from the optimizations we are already doing for JS, but define a subset that is a good compilation target and for which we know the JS VM already performs pretty optimally. So apart from defining the subset there is no extra work to do. On the other hand I'm sure there are JS limitations that you inherit. And probably your "binaries" are a bit larger than WASM. (But, I would guess, highly compressible.)

I guess the good news is that you can still use this approach. Just that no one does, because WASM stole the thunder. Again, not sure if this is a good or bad thing, but interesting to think about... for instance, whether we could have gotten to the current state much faster by just fully adopting asm.js instead of diverting resources into a new runtime.


Which only existed because Mozzilla was against adopting PNaCL.


I find it really interesting that it uses a Mamba hybrid with Transformers. Is it the only significant model right now using (at least partially) SSM layers? This must contribute to lower VRAM requirements right? Does it impact how KV caching works?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: