More

sigmoid10 · 2026-05-11T07:46:30 1778485590

Ah yes, the good old "shot in portrait mode, converted to 16:9 with added black bars, and then displayed under YT shorts in portrait mode again" category on youtube. This is almost artistic at this point. Sometimes I wonder how small can the content of a video get before people will stop watching it. Is there any research on this?

sigmoid10 · 2026-05-07T06:34:09 1778135649

Gradient descent is mathematically the most efficient optimization strategy (safe for some special functions) in high dimensions. This goes so far that people nowadays even believe it has to be used in the human brain [1], if only because every other method of updating the brain would be way too energy inefficient. From that perspective, finding the right parameterization was all we ever needed to achieve AI.

[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...

scarmig · 2026-05-07T08:21:47 1778142107

Even in supervised ML, pure gradient descent is not the most efficient optimization strategy. E.g., momentum is ubiquitous, and the updates it induces cannot be expressed as a gradient of some scalar loss. But the rotational non-gradient component of its updates substantially improves performance and convergence on the architectures we use.

The brain probably primarily uses something like TD for task learning, which is also not expressible as a gradient of any objective function. And, though the paper mentions Hebbian learning, it's only very particular network architectures (e.g. single neuron; symmetric connections) that you can treat its updates as a gradient of some energy function; these architectures aren't anything close to what we see in the brain.

sigmoid10 · 2026-05-07T12:22:33 1778156553

Pure gradient descent is not what happens in either field, but e.g. momentum is just another parameter constructed from historic gradients. While it is unlikely that the brain runs backpropagation the way you see it implemented in modern ML (same goes for TD btw), the core principle kind of needs to be the same from a pure large scale, high dimensional network efficiency POV. On top of that, adaptive plasticity is almost by definition about estimating useful directions of change. The key insight here would be that the brain does gradient estimation quite cheap and we can probably still learn a thing or two about modern ML from it.

sdenton4 · 2026-05-07T14:48:49 1778165329

Taking a quick look at the paper...

Their claim isn't that the brain uses gradient descent, but that the direction of updates has (on average) positive inner product with the gradient. I expect this would also be true for (say) simulated annealing, yet we don't say that simulated annealing is gradient descent.

There's also a discussion of loss functions and how they relate to the update missing - as far as I know, there's still no great notion of how the brain picks a global loss function, and no mechanism for backprop. In this paper, looking at a specific learning task you can define a loss function extrinsically allowing us to talk about the gradient, but how that relates to things happening in the brain is a big big mystery.

hellohello2 · 2026-05-09T21:07:08 1778360828

Why would this be true for simulated annealing?

sdenton4 · 2026-05-10T22:43:19 1778452999

Because it improves the loss!

The gradient is the direction in which loss improves the fastest. Moving in a direction with a positive dot product with the gradient just means that you're (locally) improving the loss.

hellohello2 · 2026-05-09T21:05:23 1778360723

Hmm I'm not sure what you mean by "Gradient descent is mathematically the most efficient optimization strategy". Do you mean that gradient-based optimization in general? (in other words do you consider Adam gradient descent?)

sigmoid10 · 2026-05-05T20:30:41 1778013041

Gemma certainly was trained for tool calling, but the implementation in llama.cpp has been troubled because Gemma uses a different chat template format. The processor from the transformers library works fine though.

apexalpha · 2026-05-06T11:10:07 1778065807

Oh I must've missed this.

The AI space moves so fast! I'll check it out again.

intothemild · 2026-05-06T13:32:39 1778074359

Don't forget to update the gguf you have too. The templates in them were updated recently too

sigmoid10 · 2026-05-05T20:24:01 1778012641

The simple answer is: because it is not necessary to achieve the same final output. Most LLMs today are trained as autoregressive token predictors. They fundamentally can't work any other way. But we know how to train them really well and they have many applications beyond editing text. Diffusion LLMs exist too, which work a bit closer to what you describe, but they are not yet at the same level of intelligence since training methods are not that mature and they are generally less flexible as well.

basch · 2026-05-05T20:28:19 1778012899

So predict the tokens of the operational transformation.

I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”

and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].

In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.

sigmoid10 · 2026-05-05T20:39:55 1778013595

Sounds easy, but isn't in practice. You can look at the edit text file tool in va code copilot for example to see how complicated that can get: https://github.com/microsoft/vscode-copilot-chat/tree/9e668c...

basch · 2026-05-05T20:49:28 1778014168

I have no idea when I’m being lied to anymore but allegedly Aider and Cursor work the way I described, although cursor is using a second model to apply the edit.

sigmoid10 · 2026-05-06T11:13:35 1778066015

They all do something similar under the hood. Patching files is not a trivial task when you only have the changed text content and not the actual file structure to work with. It kind of works, but is fundamentally limited by the LLM output architecture.

mike_hearn · 2026-05-06T09:02:47 1778058167

Cursor has a dedicated merge model. It takes input like this:

    class Foo {
        // ....
        int calculation() {
            return 42;
        }
    
        // more stuff
    }

where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.

sigmoid10 · 2026-05-05T08:53:11 1777971191

One upside to this is that it doesn't use Gemma and instead uses Gemini. So at least for Gemini Nano (apparently called XS internally by Google) it means that the weights are now de facto open and you no longer need a current Android phone to get the latest and best model in this class. This also makes it the only open American frontier-level model right now.

HumanOstrich · 2026-05-05T09:14:24 1777972464

Can you provide any sources for that? I'd like to learn more about this open frontier model.

sigmoid10 · 2026-05-05T09:41:02 1777974062

Sources for what? The pareto frontier of LLMs? How Google is pretty much on the line with most of their LLM products? Or this particular model? For the first two you need to look for size/cost vs. accuracy charts. There are tons of them floating around. For the latter there is not much official info except what you can infer by analyzing the weights.bin file that Chrome downloads. But it does mention Gemini in there, so it seems pretty obvious that it is from their proprietary line of models.

lxgr · 2026-05-05T10:11:36 1777975896

Just because it's called Gemini doesn't mean that it's somehow automatically as comparable with the frontier of small models as well, does it?

sigmoid10 · 2026-05-05T10:17:36 1777976256

All Gemini models sit around the frontier, especially if you go to smaller sizes. Google is actually more invested into efficiency than size unlike some of the other big providers.

lxgr · 2026-05-05T10:21:11 1777976471

Do you have any benchmark details on the on-device Gemini models? I haven't found a lot of public information on these.

HumanOstrich · 2026-05-05T10:10:16 1777975816

Sources for your claim that the model being downloaded to Android/Chrome is Gemini instead of Gemma. Other than downloading the bin file myself and analyzing it lol.

sigmoid10 · 2026-05-05T10:18:31 1777976311

How about Google itself?

https://developer.chrome.com/docs/ai/prompt-api

>With the Prompt API, you can send natural language requests to Gemini Nano in the browser.

HumanOstrich · 2026-05-05T10:36:05 1777977365

Thanks. Looks like the current Gemini Nano is actually a separate model with the Gemma 3n architecture that has been distilled from Gemini 2.5 Flash[1].

Also, the next version of Gemini Nano will be based directly on Gemma 4 (so not distilled, not Gemini at all except for the name)[2].

So no, it's not a frontier model. Those don't run on your phone or in your browser.

[1]: https://developer.android.com/blog/posts/ml-kit-s-prompt-api...

[2]: https://android-developers.googleblog.com/2026/04/AI-Core-De...

sigmoid10 · 2026-05-05T12:30:10 1777984210

Oh, now I see your problem. You confused the pareto frontier with the pure scale frontier. They are very much not the same.

Also, distillation is how most of these smaller models are made from the biggest models. That process largely defines the frontier along most of the curve.

HumanOstrich · 2026-05-05T13:32:02 1777987922

> This also makes it the only open American frontier-level model right now.

I'm not going to keep arguing with you. If you want to keep arguing, go to https://gemini.google.com/. Gemini knows what a frontier model is and it knows that Gemini Nano is fundamentally different from the other Gemini models. For one, it uses the Gemma architecture. And the next version of Gemini Nano is built directly on Gemma 4.

As for your original claim that I quoted, there are other "open American frontier-level models" by your definition. Like Gemma 4.

sigmoid10 · 2026-05-07T12:33:28 1778157208

I'm surprised how you try to evade the facts and even try to bring in Gemini in a vain attempt to support your argumentation, when a simple google search would have already pointed you at things like this:

https://cloud.google.com/blog/topics/developers-practitioner...

or this:

https://arena.ai/leaderboard/text/overall/pareto

The lines in these charts is literally what the frontier is on a technical level. None of what I said has any ambiguous terminology. This is common language in the field. Neither is the fact that google cares a lot about this. I don't see why you still feel the need to argue about any of this.

sigmoid10 · 2026-05-05T08:49:00 1777970940

Do you think this will not be part of some google product? On top of their normal agenda, this seems perfectly suited for them to push their AI models. So if you use anything from Google via Chrome, I would expect that this will end up on your device sooner or later.

sigmoid10 · 2026-05-04T10:54:21 1777892061

That's just for the cash part. The stock part makes no sense. For this 50/50 deal to work in principle, they'd need to issue around a billion new shares, which would massively dilute the existing ~450M shares. So Ebay shareholders would suddenly own 70% of Gamestop after the deal. It's also highly questionable if investors actually believe the combined stock is worth that much, so the stock price would probably fall and turn those 70% into >90%. At this point it basically becomes a reverse acquisition plus a large loan for the final company from the cash part of the deal.

ryandamm · 2026-05-04T11:40:29 1777894829

This is not atypical; smaller company “buys” the larger company with debt on the larger company’s books. The blended shareholder mix is mostly the larger company; management comes from the smaller company.

The one I was most familiar with was the Discovery “acquisition” of Warner Brothers. Though apparently that’s a little complicated because AT&T was divesting itself of Warner.

sigmoid10 · 2026-05-03T18:47:11 1777834031

I wouldn't be too sure about that. The original decompilations of Mario 64 and Ocarina of Time were done mostly by hand because LLMs weren't really around yet, but these kinds of projects seem perfectly suited for handing the gritty work off to AI: There is a clear output (exact binary recreation) and a straightforward path to get there (look at this assembly code and produce some C code from it). The decompilation of Twilight Princess jumped from very little to basically 100% of core code in the past year alone: https://github.com/zeldaret/tp

I have no doubt that this would be possible for MGS2 as well.

paavohtl · 2026-05-03T20:15:08 1777839308

I don't think it's impossible, but it would take a lot of time and a lot of money; likely more time than good enough models have been commercially available.

I have been working on an incremental decompilation-based reimplementation (basically how OpenRCT2 was done) of Worms Armageddon for the past 2 months with a lot of help from LLM tools; primarily Claude Code and Ghidra MCP. I've worked on it almost every day, reaching Claude Code Max 5x's 5 hour session limit multiple times every day. Suffice to say as a software rendered, sprite-based 90s PC game, Worms Armageddon is several orders of magnitude simpler than MGS2. Despite that, I think it will be 2-3 more months of work before I can compile a fully independent version of the game.

This is despite the game being an almost ideal candidate for automated RE, as it uses deterministic game logic with built-in checksum checks in replays and multiplayer. I've downloaded all the speedruns I could find for the game (as replay files) and I've retrofitted the replay system into a massively parallel test framework, which simulates over 600 games in about 30 seconds. So Claude can port all game logic independently without much need for manual testing; the replay tests can almost guarantee perfect correctness.

MGS2 doesn't have anything like that, so every ported function requires extensive manual testing. Even with LLM tools, an accurate decomp could take years (unless you're willing spend thousands of $currency per month on it).

networked · 2026-05-03T20:40:13 1777840813

This is really cool! Your process is compelling, and your choice of game is excellent. I'd like to read a long blog post about your entire journey from the beginning to a working binary once you get there.

For those wondering, there is a public Git repository at https://github.com/paavohuhtala/OpenWA.

paavohtl · 2026-05-03T20:47:23 1777841243

As it happens I do have the habit of writing very long blog posts - though none on OpenWA so far. The OpenWA readme file serves as a bit of an introduction, though it's already a month old.

SpecialistK · 2026-05-03T19:01:35 1777834895

Keep your eyes open for Sonic R too. Sadly a lot of the online Sonic community has been toxic to the dev for being transparent about using Claude for the majority of the disassembly. Even though he's a very talented developer with lots of credit to his name, and only took a few weeks compared to a year+ if fully manual.

InvisibleUp · 2026-05-03T19:51:59 1777837919

Having followed his bsky during his announcement, he started off per-emptively dissing on his haters that... didn't even exist yet. Constantly posting memes about how everyone was dissing him and how AI was totally superior (and then posting his angry sessions with Claude when it got something wrong) when most other users were just "that's cool man". The thing that made him quit bsky was a (now-deleted) thread someone posted criticizing the weird crash-outs. I think he was more... normal about the whole thing, people would have received the project quite a bit more positively.

AshamedCaptain · 2026-05-03T19:53:53 1777838033

Decompilation to C (and even C++!) has been done automatically for 2-3 decades at least. I am not sure what has changed in recent years other than people playing fast and loose with copyright (and GitHub allowing it, likely because their LLMs also stand to benefit). Introducing LLMs here is only going to introduce errors, delays and likely push you away from a reliable result.

The challenge here is readability. Reading the TP source leak you link I think it's even behind the current state of the art, as it's barely above assembly. This is where I suspect even the smallest of LLMs may help, since you don't care that much if it introduces errors.

sigmoid10 · 2026-05-03T21:22:57 1777843377

>Decompilation to C (and even C++!) has been done automatically for 2-3 decades at least.

Only in a very rudimentary sense and definitely not in a working compilation (much less binary equivalent) sense. LLMs have turned this from a gimmick for static analysis into something that actually works pretty well for recompilation projects.

AshamedCaptain · 2026-05-03T23:38:00 1777851480

> Only in a very rudimentary sense and definitely not in a working compilation (much less binary equivalent) sense.

Working is the easy part; the hard part is getting something that classifies as readable C. LLMs do not really help reach the "working compilation" part but benefit from it.

sigmoid10 · 2026-05-04T07:44:31 1777880671

We are way past "working compilation" when it comes to LLMs. They are already really good at writing readable, compliable code. The big problem with LLMs is making sure the output binary actually does what you wanted it to do. But if you define the goal not merely as instructions in a vague, unspecific human language and rather as recreating a given set of binary instructions after compilation, this big drawback goes away. So in a sense they are better suited for recompilation projects than for developing new applications.

AshamedCaptain · 2026-05-04T10:11:12 1777889472

My point is that we have been past the "working compilation" way before LLMs, and I do not think anything in LLMs help with it, at best agents use these tools with the same efficiency. I disagree that they're good at writing compilable code, but agree on the readable part.

sigmoid10 · 2026-05-05T09:21:15 1777972875

Which decompiler reliably produced working, high level C/C++ from assembly? I would have loved to use this thing you are describing here 15 years ago. Compilation is inherently lossy, so any system that could have given you this would have needed pretty heavy LLM-like features anyways.

>I disagree that they're good at writing compilable code

That was never part of the discussion, because as explained several times now it is irrelevant in this case. The existence of the original binary means all you need to do is match up things, which can be automated completely.

AshamedCaptain · 2026-05-08T11:14:42 1778238882

I do not understand what is it so hard to "generate working code". Even the free version of Hexrays was doing it 15 years ago, and I have written one in my company that I have used for over 30 years. It's actually ... trivial?

The problem is readability. No one in his right mind would call what they generate "C++". Mine still interjects assembler from time to time (and not the new version that GCC supports, but the older MSVC style) .

LLMs absolutely do not help with the generate working code part, because this is an exact problem that doesn't need nor benefit from an LLM (other than maybe automating stupid iteration?). They can help with the readability part, because here once you already have a working skeleton it doesn't matter that much if they make mistakes, as it is easy to detect.

sigmoid10 · 2026-05-09T10:52:07 1778323927

I already asked, but I guess I'll need to ask again: Please show me this tool. Hex-rays is certainly the wrong answer, because the decompiled C code usually needs tons of manual cleaning, fixing datatypes and reconstructing function prototypes before you can compile. And even then you can't be sure about functional (much less binary) equivalence. If anything, all these traditional decompilers focused on readability, not recompilability. But even there they were much worse than LLMs.

If what you said was true, the projects mentioned above wouldn't have needed years of arduous work before the age of LLMs came to be.

AshamedCaptain · 2026-05-09T19:27:30 1778354850

I get the point, but note that (custom) datatypes and function prototypes are for readability. They are not required for working nor functionally-equivalent code.

jamesu · 2026-05-03T19:13:20 1777835600

My take was more along the lines of: it wouldn't be convincing enough, if anything it would be too clean and perfect.

Andrex · 2026-05-03T19:46:50 1777837610

Does the TP decomp use AI to achieve their speed?

sigmoid10 · 2026-05-02T06:18:10 1777702690

>A humanoid robot takes roughly 5,000 steps per hour. Each step sends a shock of 2–3× body weight through the leg actuators—forces that would be fine occasionally, but become destructive when repeated thousands of times without pause.

As someone who comes from the world of running and knee problems, I feel this misses the issue. Normal walking should not produce these kinds of shocks unless your gait is really jumpy or otherwise screwed up. You only start to see these forces when running and that's where technique becomes important even for humans if you want to prevent damage to your joints over long distances. But at least for walking I suppose that a fully articulated humanoid with all the degrees of freedom of human gait should be mostly a control problem, not a mechanical engineering one.

imtringued · 2026-05-02T08:49:16 1777711756

The force an impulse generates on a contact depends on the speed of deceleration. It's just F=m*a

Slow deceleration leads to low forces. If you have a contact event with a hard substance, like a rigid metal for accurate kinematics, the deceleration to zero upon a contact has to happen instantly. Meaning the deceleration is incredibly high, resulting in extremely high forces for a few milliseconds.

Human bodies are made out of a flexible and impact resistant material: water. When a contact event happens, your body deforms, which means that the deceleration happens over a longer time frame with less force. Not just that, your muscles also have a certain amount of flexibility in them and basically zero internal inertia. All the inertia is in the limb as a whole, whereas for a robot there is a spinning motor and gearbox that needs to slow down as well.

You could solve this as a control problem by adding series elastic actuators, which means you need to change your mechanical design.

hwntw · 2026-05-02T10:05:08 1777716308

The human body goes further than that too, when you're out jogging - as your foot approaches the ground for a stride, you slow the velocity of your foot downwards towards the ground so there's less of a sudden deceleration.

Imagine when you throw a tennis ball high in the sky, and then you catch it on your racket without bouncing by matching it's velocity, your feet do the same thing with the ground on a smaller scale.

nixon_why69 · 2026-05-02T12:44:35 1777725875

Then you have several hinges absorbing/dissipating that energy if you're using good form: foot flexes with a pivot in the arch of your foot, calf/achilles stretches with a pivot in your ankle and quad with a pivot in your knee. It should look like an angled, backwards Z at strike with nothing just straightened and tanking the impact.

Nobody actually runs perfectly enough to take 100% of the impact out of your joints but good form routes as much as possible into the muscles/ligaments around the joints instead of straight through them. It's a lot of little bitty unconscious nerve endings and muscles so one could expect it will take a while to iron out for robots.

Thinking about it more, maybe the issue here is that there's no self-healing stretchy ligaments involved in robots to begin with, even before the control issue.

sigmoid10 · 2026-05-03T08:51:34 1777798294

That's missing the point. Try jump-walking with your legs locked straight on impact. You'll feel the pressure on your joints pretty quickly. Now try walking normally (i.e. hitting your heels with your legs extended while your center of mass barely moves vertically). The acceleration your body accumulates under gravity will be way shorter and so will be the deceleration force your joints have to bring up to compensate.

sigmoid10 · 2026-05-01T08:52:27 1777625547

It is lossy, but it is still enough for verbatim recreations. All of Wikipedia is just 24GB of lossless compressed text and all of JK Rowling's work fits into a few MB. So these things would easily be storable verbatim in trillion parameter models. Reasoning about the training cutoff is also something that the newest models do pretty well, because you can teach them to do so after pre training using e.g. SFT. With tool use it can then even check actual current sources, which may happen without you even knowing in the normal chat apps unless you use a controlled API call.