n7g's comments

n7g · 2026-02-26T09:58:23 1772099903

Recursive models like TRM/CTM/UT have create a lot of buzz lately. But they're rarely used outside of static, toy domains - especially language.

In 2018, we saw "Universal Transformers" try this. However, follow-up works reveal that simple RLMs (recursive LMs) don't yield substantial performance gains w.r.t FLOPs spent

In this work, we argue that using some rather simple tricks, one can unlock huge performance gains and make RLMs outperform iso-param and iso-FLOP baselines

n7g · on Aug 17, 2024

I haven't used TinyGrad but I'm not really sure what its goal is. To be the best autograd framework? to be a minimal one?

I'm glad they've removed the (rather arbitrary, and admittedly stupid) loc cap. And from the little I know, geohot is focusing on having its own internal compiler stack.

As much as I admire geohot, I don't think rolling your own compiler is the best way. Its not that the TinyGrad team isn't smart enough, but a compiler is a huge undertaking and that you have to support and maintain for a long time. I'm sure he's well aware of this, but no big labs would touch TG seriously because of this limitation.

XLA on the other hand is under governance seperate from Google, and is far more mature - so people trust that.

That said, I don't know much about Tinygrad so I would appreciate if someone more knowledgable can jump in here and outline the differences and key features ¯\_(ツ)_/¯

n7g · on Aug 17, 2024

Hey patrick, love your work!

I think the biggest, well "con" I've seen is non-technical - the fear of JAX being killed by Google.

I mention in the blog as well [here](https://neel04.github.io/my-website/blog/pytorch_rant/#gover...) how important having an independent governance structure is. I'm sure for many big companies and labs, the lack of a promise of long-term, stable support is a huge dealbreaker.

I'm not sure how much Google bureaucracy would limit this, but have you raised the subject of forming an independent entity to govern JAX, very much like PyTorch? I believe XLA is protected, as its with TF governance. But perhaps, there could be one for JAX's ecosystem as well, encompassing optax, equinox, flax etc.

Eridrus · on Aug 17, 2024

I can personally say, I am not super concerned about it being killed. Google supported TF1 for quite a long time and all these projects have a shelf life.

What concerned me about JAX, at a small company, is that it doesn't benefit from the network effects of almost everyone developing for it. E.g. There is no Llama 3.1 implementation in JAX afaict.

So as long as there is a need to pull from the rest of the world the ecosystem will trump the framework.

Activity in the LLM space is slowing down though, so there is an opportunity to take the small set of what worked and port it to JAX and show people how good that world is.

n7g · on Aug 17, 2024

> with torch.compile the main advantage of jax is disappearing.

Interesting take - I agree here somewhat.

But also, wouldn't you think a framework that has been from the ground-up designed around a specific, mature compiler stack be better able to integrate compilers in a more stable fashion than just shoe-horning static compilers into a very dynamic framework? ;)

wafngar · on Aug 17, 2024

Depends. PyTorch on the other hand has a large user base and well defined and tested api. So should be doable; and is already progressing and rapid speed..

anon389r58r58 · on Aug 17, 2024

So the answer is not Jax?

Because JAX is not designed around a mature compiler stack. The history of Jax is more so that it matured alongside the compiler...

n7g · on Aug 17, 2024

> In another timeline AI would have made Lua popular.

I wonder if it'd have been hated more than Python is - especially with the 1-based indexing...

goatlover · on Aug 18, 2024

Scientific computing tends to be 1-based. Thus R, Julia, Fortran, Matlab.

CuriouslyC · on Aug 17, 2024

Python isn't hated AFAICT, though people will profess to hating building large projects in it (myself included), but many of those people also love it for shorter programs and scripts.

dartos · on Aug 18, 2024

Everything is hated.

Python has always gotten hate for being super super slow and having an ugly syntax (subjective ofc, but I happen to agree)

n7g · on Aug 17, 2024

Hey, thanks for actually engaging with the blog's points instead of "Google kills everything it touches" :)

1. I'm well aware of the PyTorch stack, but this point:

> PyTorch is building towards a multi-backend future isn't really where things are going

>PyTorch supports extensibility of backends (including XLA)

Is my problem. Those backends just never integrate well as I mentioned in the blogpost. I'm not sure if you've ever gone into the weeds, but there are so many (often undocumented) sharp edges when using different backends that they never really work well. For example, how bad Torch:XLA is and the nightmare inducing bugs & errors with it.

> torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

2. Torch/XLA is indeed primarily meant for TPUs - like the quoted announcement, where they declare to be ditching TF:XLA in favour of OpenXLA. But there's still a very real effort to get it working on GPUs - infact, a lab on twitter declared that they're using Torch/XLA on GPUs and will soon™ release details.

XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

So I won't be surprised if in the long-term they lean on XLA. Whether that's a good direction or not is upto the devs to decide unfortunately - not the community.

3. Thank you for pointing that out. I'm not sure about the history of JAX (maybe might make for a good blogpost for JAX devs to write someday), but it seems that it was indeed developed at Google research, though also heavily supported + maintained by DeepMind.

Appreciate you giving the time to comment here though :)

smhx · on Aug 19, 2024

If you're the author, unfortunately I have to say that the blog is not well-written -- misinformed about some of the claims and has a repugnant click-baity title. you're getting the attention and clicks, but probably losing a lot of trust among people. I didn't engage out of choice, but because of a duty to respond to FUD.

> > torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

> That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic". Adding constraints doesn't mean you have to give up on important flexibility.

> This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

I think this is an assumption you've made largely without evidence. I'm not entirely sure what your point is. The way torch.compile is measured for success publicly (even in the announcement blogpost and Conference Keynote, link https://pytorch.org/get-started/pytorch-2.0/ ) is by measuring on a bunch of popular PyTorch-based github repos in the wild + popular HuggingFace models + the TIMM vision benchmark. They're curated here https://github.com/pytorch/benchmark . Your claim that its to mainly favor large labs is pretty puzzling.

torch.compile is both dynamic and flexible because: 1. it supports dynamic shapes, 2. it allows incremental compilation (you dont need to compile the parts that you wish to keep in uncompilable python -- probably using random arbitrary python packages, etc.). there is a trade-off between dynamic, flexible and performance, i.e. more dynamic and flexible means we don't have enough information to extract better performance, but that's an acceptable trade-off when you need the flexibility to express your ideas more than you need the speed.

> XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

If you are an XLA maximalist, that's fine. I am not. There isn't evidence to prove out either opinions. PyTorch will never be nicely compatible with XLA until XLA has significant constraints that are incompatible with PyTorch's User Experience model. The PyTorch devs have given clear written-down feedback to the XLA project on what it takes for XLA+PyTorch to get better, and its been a few years and the XLA project prioritizes other things.

n7g · on Aug 20, 2024

> There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic"

In the context of scientific computing - this is completely, blatantly false. We're not lowering low-level IR to machine code. We want to perform certain mathematical processes often distributed on a large number of nodes. There's a difference between ensuring optimization (i.e no I/O bottlenecks, adequate synchronization between processes, overlapping computation with comms) vs. simply transforming a program to a different representation.

This is classic [false analogy](https://simple.wikipedia.org/wiki/False_analogy)

Adding constraints does mean that you give up on flexibility precisely because you have to work around them. For example, XLA is constrained intentionally against dynamic-loops because you lose a lot of performance and suffer a huge overhead. So the API forces you to think about it statically (like you can work around it with fancier methods like using checkpointing and leveraging a tree-verse algorithm)

I'll need more clarification regarding this point, because I don't know what dev in which universe will not regard "constraints" as flying against the face of flexibility.

> popular HuggingFace models + the TIMM vision benchmark

Ah yes, benchmark it on models that are entirely static LLMs or convnet-hybrids. Clearly, high requirement on dynamicness and flexibility there.

(I'm sorry but that statement alone has lost you any credibility for me.)

> Your claim that its to mainly favor large labs is pretty puzzling.

Because large labs often play with the safest models, which often involves scaling them up (OAI, FAIR, GDM etc.) and those tend to be self-attention/transformer like workloads. The devs have been pretty transparent about this - you can DM them if you want - but their entire stack is optimized for these usecases.

And ofcourse, that won't involve considering for research workloads which tend to be highly non-standard, dynamic and rather complex and much, much harder to optimize for.

This is where the "favouring big labs" comes from.

> 1. it supports dynamic shapes

I agree that in the specifically narrow respect of dynamic shapes, it's better than XLA.

But then it also misses a lot of the optimization features XLA has such as its new cost model and Latency Hiding Scheduler (LHS) stack which is far better at async overlapping of comms, computations and even IO (as its lazy).

> there is a trade-off between dynamic, flexible and performance

Exactly. Similarly, there's a difference in the features offered by each particular compiler. Torch's compiler's strengths may be XLA's weakness, and vice-versa.

But its not perfect - no software can be, and compilers certainly aren't exceptions. My issue is that the compiler is being considered at all in torch.

There are use-cases where the torch.compile stack fails completely (not sure how much you hang around more research-oriented forums) wherein there are some features that simply do not work with torch.compile. I cited FSDP as the more egregious one because its so common in everyone's workflow.

That's the problem. Torch is optimizing their compiler stack for certain workloads, with a lot of new features relying on them (look at newly proposed DTensor API for example).

If I'm a researcher with a non-standard workload, I should be able to enjoy those new features without relying on the compiler - because otherwise, it'd be painful for me to fix/restrict my code for that stack.

In short, I'm being bottlenecked by the compiler's capabilities preventing me to fully utilize all features. This is what I don't like. This is why torch should never be leaning at a compiler at all.

It 'looks' like a mere tradeoff, but reality is just not as simple as that.

> XLA:GPU

I don't particularly care if torch uses whatever compiler stack the devs choose - that's beside the point. Really, I just don't like the compiler-integrated approach at all. The choice of the specific stack doesn't matter.

n7g · on Aug 17, 2024

Curious to know how the OP is biased when they have no conflict of interest, and they explicitly mention the `torch.compile` stack like... a few dozen times in the blog?