More

zhangchen · 2026-04-02T01:50:29 1775094629

this tracks with what i've seen too. gemini tends to 'overthink' tool calls - it'll reason about whether to use a tool instead of just using it. in my experience the models that are best at agentic tasks are the ones that commit to a tool call quickly and recover from failures, not the ones that deliberate forever and sometimes bail. would be interesting to see if the benchmark captures retry behavior since thats where cost-effectiveness really diverges

zhangchen · 2026-03-19T01:35:46 1773884146

this lines up with what pruning papers have been finding, the middle layers carry most of the reasoning weight and you can often drop the outer ones without much loss. cool to see the inverse also works, just stacking them for extra passes.

zhangchen · 2026-03-18T01:38:36 1773797916

Has anyone tried implementing something like System M's meta-control switching in practice? Curious how you'd handle the reward signal for deciding when to switch between observation and active exploration without it collapsing into one mode.

robot-wrangler · 2026-03-18T02:47:03 1773802023

> Curious how you'd handle the reward signal for deciding when to switch between observation and active exploration without it collapsing into one mode.

If you like biomimetic approaches to computer science, there's evidence that we want something besides neural networks. Whether we call such secondary systems emotions, hormones, or whatnot doesn't really matter much if the dynamics are useful. It seems at least possible that studying alignment-related topics is going to get us closer than any perspective that's purely focused on learning. Coincidentally quanta is on some related topics today: https://www.quantamagazine.org/once-thought-to-support-neuro...

fallous · 2026-03-18T03:57:38 1773806258

The question is does this eventually lead us back to genetic programming and can we adequately avoid the problems of over-fitting to specific hardware that tended to crop up in the past?

t-writescode · 2026-03-18T03:38:56 1773805136

Or possibly “in addition to”, yeah. I think this is where it needs to go. We can’t keep training HUGE neural networks every 3 months and throw out all the work we did and the billions of dollars in gear and training just to use another model a few months.

That loops is unsustainable. Active learning needs to be discovered / created.

exe34 · 2026-03-18T07:37:08 1773819428

if that's the arguement for active learning, wouldn't it also apply in that case? it learns something and 5 minutes later my old prompts are useless.

naasking · 2026-03-18T14:03:37 1773842617

I don't think old prompts would become useless. A few studies have shown that prompt crafting is important because LLMs often misidentify the user's intent. Presumably an AI that is learning continuously will simply get better at inferring intent, therefore any prompts that were effective before will continue to be effective, it will simply grow its ability to infer intent from a larger class of prompts.

t-writescode · 2026-03-18T09:56:31 1773827791

That depends on the goals of the prompts you use with the LLM:

* as a glorified natural language processor (like I have done), you'll probably be fine, maybe

* as someone to communicate with, you'll also probably be fine

* as a *very* basic prompt-follower? Like, natural language processing-level of prompt "find me the important words", etc. Probably fine, or close enough.

* as a robust prompt system with complicated logic each prompt? Yes, it will begin to fail catastrophically, especially if you're wanting to be repeatable.

I'm not sure that the general public is that interested in perfectly repeatable work, though. I think they're looking for consistent and improving work.

zhangchen · 2026-03-16T02:04:20 1773626660

the Squelch primitive for mathematical forgetting is really interesting. most memory systems I've worked with treat forgetting as an afterthought, just TTL-based eviction or manual deletion. having it built into the algebra itself is a much cleaner approach for agents that need to update beliefs over time without accumulating stale context.

kendallgclark · 2026-03-16T12:00:13 1773662413

Thinking about data in terms of energy basins, frequencies, and SNR is different than what I’ve done before.

But pushing a signal below the noise floor is analogous to tombstoning a tuple in a database.

zhangchen · 2026-03-15T01:35:59 1773538559

this is way too broad, RAG works fine in the 10k-1M doc range if your chunking and retrieval pipeline are tuned properly. the failure mode is usually bad embeddings or naive chunking, not RAG itself.

zhangchen · 2026-03-14T02:05:27 1773453927

the steerability point is interesting. have you tried using task-specific prompts for cross-modal retrieval though? like searching images with text queries. curious whether qwen's prompt-based steering actually helps there or if it mainly improves same-modality tasks. the 3072-dim space seems tight for encoding all those modalities well.

Grimblewald · 2026-03-17T08:53:21 1773737601

Does well in my tests, limited as they were, but it did well in zero-shot tasks in niche domains historically (and possibly still here) underrepresented in training data (microscopy)

zhangchen · 2026-03-13T01:41:17 1773366077

fwiw the merge rate metric itself might be misleading. most real codebases have implicit conventions and architectural patterns that aren't captured in the issue description, so even if the model writes correct code it might not match what the maintainer actually wanted. imo the bigger signal is how much back-and-forth it takes before merging, not whether the first attempt lands cleanly.

zhangchen · 2026-03-12T01:38:59 1773279539

Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.

zhangchen · 2026-03-11T05:33:45 1773207225

certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.

silentsvn · 2026-03-11T11:11:27 1773227487

You're right, and it's the part that keeps me up. We handle it with versioned writes — each memory has a createdAt, observedAt, and a validUntil that can be set explicitly or inferred from context. Temporal scope gets embedded as metadata: "as of last session" vs "persistent fact."

Causal ordering is harder. Right now we surface both conflicting versions during retrieval with timestamps and let the agent reason about which is authoritative. It's not a complete solution — the agent can still pick wrong without the right reasoning context.

What you're describing is architecturally the right answer. We haven't built proper write-ordering yet. That's probably where the next cycle goes.

zhangchen · 2026-03-10T12:18:02 1773145082

that's already happening tbh. the real issue isn't hypocrisy though, it's that maintainers reviewing their own LLM output have full context on what they asked for and can verify it against their mental model of the codebase. a random contributor's LLM output is basically unverifiable, you don't know what prompt produced it or whether the person even understood the code they're submitting.

hijnksforall956 · 2026-03-10T13:15:12 1773148512

How is that different than before LLMs? You have no idea how the person came up with it, or whether they really understood.

We are inventing problems here. Fact is, an LLM writes better code than 95% of developers out there today. Yes, yes this is Lake Wobegone, everyone here is in the 1%. But for the world at large, I bet code quality goes up.

duskdozer · 2026-03-10T13:44:45 1773150285

It's a lot harder for someone who has no clue what they're doing to write a lot of plausible-but-wrong code.