> Principal component analysis of 200 GPT2, 500 Vision Transformers, 50 LLaMA-
8B, and 8 Flan-T5 models reveals consistent sharp spectral decay - strong evidence that a small number of weight
directions capture dominant variance despite vast differences in training data, objectives, and initialization.
Well intuitively it makes sense that within each independent model, a small number of weights / parameters are very dominant, but it’s still super interesting that these can be swapped between all the models without loss of performance.
It isn’t obvious that these parameters are universal across all models.
This general idea shows up all over the place though. If you do 3D scans on thousands of mammal skulls, you'll find that a few PCs account for the vast majority of the variance. If you do frequency domain analysis of various physiological signals...same thing. Ditto for many, many other natural phenomena in the world. Interesting (maybe not surprising?) to see it in artificial phenomena as well
It's almost an artifact of PCA. You'll find "important" principal components everywhere you look. It takes real effort to construct a dataset where you don't. That doesn't mean though, for instance, that throwing away the less important principal components of an image is the best way to compress an image.
Not really. If the models are trained on different dataset - like one ViT trained on satellite images and another on medical X-rays - one would expect their parameters, which were randomly initialized to be completely different or even orthogonal.
Now I wonder how much this "Universal Subspace" corresponds to the same set of scraped Reddit posts and pirated books that apparently all the bigcorps used for model training. Is it 'universal' because it's universal, or because the same book-pirating torrents got reused all over?
Every vision task needs edge/contrast/color detectors and these should be mostly the same across ViTs, needing only a rotation and scaling in the subspace. Likewise with language tasks and encoding the basic rules of language which are the same regardless of application. So it is no surprise to see intra-modality shared variation.
The surprising thing is inter-modality shared variation. I wouldn't have bet against it but I also wouldn't have guessed it.
I would like to see model interpretability work into whether these subspace vectors can be interpreted as low level or high level abstractions. Are they picking up low level "edge detectors" that are somehow invariant to modality (if so, why?) or are they picking up higher level concepts like distance vs. closeness?
It hints there may be common higher-level abstraction and compression processes in human consciousness.
The "human" part of that matters. This is all human-made data, collected from human technology, which was created to assist human thinking and experience.
So I wonder if this isn't so much about universals or Platonic ideals. More that we're starting to see the outlines of the shapes that define - perhaps constrict - our own minds.
> Datasets. We construct a diverse and high-quality collection of video datasets to train STARFlow-V. Specifically, we leverage the high-quality subset of Panda (Chen et al., 2024b) mixed with an in-house stock video dataset, with a total number of 70M text-video pairs.
That has nothing to do with it, and Apple wouldn’t train on user content, they’re not Google. If they ever did there would be opt in at best. There’s a reason they’re walking and observing, not running and trying to be the forefront cloud AI leader, like some others.
I have them and like them. I don't wear them constantly, but on days when I'm doing something interesting, they help me document much more than I otherwise would.
Python has about 40 keywords, I say I would regularly use about 30, and irregularly use about another 5. Hardly seems like a "junkyard".
Further, this lack of first class support for lazy importing has spawned multiple CPython forks that implement their own lazy importing or a modified version of the prior rejected PEP 690. Reducing the real world need for forks seems worth the price of one keyword.
False await else import pass
None break except in raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
async elif if or yield
Soft Keywords:
match case _ type
I think nonlocal/global are the only hard keywords I now barely use, for the soft ones I rarely use pattern matching, so 5 seems like a good estimate
> The choice to introduce a new `lazy` keyword reflects the need for explicit syntax. Lazy imports have different semantics from normal imports: errors and side effects occur at first use rather than at the import statement. This semantic difference makes it critical that laziness is visible at the import site itself, not hidden in global configuration or distant module-level declarations. The lazy keyword provides local reasoning about import behavior, avoiding the need to search elsewhere in the code to understand whether an import is deferred. The rest of the import semantics remain unchanged: the same import machinery, module finding, and loading mechanisms are used.
This functionality is highly desired, and it does appear to actually need a new (soft) keyword. Sorry you don't like it.
The pep didn’t mention considering reusing `async` instead of `lazy`. That would’ve conveyed the same thing to me without a new keyword, and would haven’t been similar to html’s usage `async`.
It is a 'soft keyword' as the PEP explains. I would not think that this has any major impact on anyone who just chooses to ignore this feature. Assuming that you want this behavior, I wonder how this could have been done in a better fashion without now having 'lazy' in the specific context of an import statement.
soft keyword for anyone not familiar like I was ...
"A new soft keyword lazy is added. A soft keyword is a context-sensitive keyword that only has special meaning in specific grammatical contexts; elsewhere it can be used as a regular identifier (e.g., as a variable name). The lazy keyword only has special meaning when it appears before import statements..."
> Python is quickly turning into a crowded keyword junkyard
* Javascript (ECMAScript) has 63 keywords.
* Rust has 50 keywords.
* Java has 51 keywords + 17 contextually reserved words, for a total of 68.
* Python has now 36 keywords + 4 'soft' keywords, for a total of 40.
* Go has 25 keywords.
Speed matters everywhere. How much compute is spent on things that could easily be 100x faster than they are? Compare using VMware with pip to run a battery of unit tests with firecracker plus uv. It’s orders of magnitude quicker, and avoids a whole suite of issues related to persistent state on the machine
Possibly for some workflows, though personally I find the emphasis on speed baffling and a big part of the reason I don’t find most of these uv testimonials credible. I’m a regular python user across multiple environments and I’ve never considered waiting for pip to be a material part of my time, it’s trivial to the point of being irrelevant. The fact that so many people come out of the woodwork to talk about how fast it is, means either there’s some big group somewhere with a niche use case that gets them bogged down in pip dependency resolving or whatever gets sped up (obviously the actual downloading can’t be faster) or it’s just a talking point that (presumably) rust zealots who don’t actually use python arrive with en mass, but it’s honestly an extremely ineffective way of promoting the product to most python users who don’t have speed of package installation as anything close to a pain point.
It's fast enough that sometimes dependencies can be checked and resolved and installed at program runtime rather than it needing to be a separate step.
You can go from no virtual environment, and just "uv run myfile.py" and it does everything that's needed, nearly instantly.
lol who is using pip so much that .36s of startup time matters to them? This, if presumably uv can do nothing slightly faster, is an absolutely meaningless benefit
In general, whenever you introduce a cache to make software faster (along any dimension), you have to think about cache invalidation and eviction. If your software is fast enough to not need caching, this problem goes away.
It's funny because superior caching is also highly relevant to uv's outperformance. (But invalidation/eviction isn't generally a real problem for a cache of installed packages; the cache can be cleaned up whenever and just rebuilt , and the cache has a separate entry per version of a library, where each version is immutable.)
Isn't it obvious?
reply