More

rytill · 2025-11-30T18:33:29 1764527609

It is not a "narrative", "philosophical paradigm", or him "getting high on his own supply". It is simply him sharing his thoughts about something.

BanditDefender · 2025-11-30T20:31:21 1764534681

He is in fact getting high on his own supply of narratives and philosophical paradigms. There are no facts in that entire blog post. It's a useless fart in the wind.

dang · 2025-12-01T02:26:02 1764555962

Could you please stop posting shallow and curmudgeonly dismissals? It's not what this site is for, and destroys what it is for.

If you want to make your substantive points without putdowns, that's fine, but please don't use this place to superciliously posture over others.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

aeve890 · 2025-11-30T18:37:43 1764527863

Alright that's a valid answer. Thank you.

rytill · 2025-11-26T19:30:02 1764185402

It really doesn’t, at all. Every sentence has a clear, non-equivocative meaning and it doesn’t use any LLM tropes. Your LLM sensor is seriously faulty.

rytill · 2025-11-17T19:28:30 1763407710

What is the goal of doing that vs using L2 loss?

counters · 2025-11-17T23:10:25 1763421025

To add to the existing answers - L2 losses induce a "blurring" effect when you autoregressively roll out these models. That means you not only lose import spatial features, you also truncate the extrema of the predictions - in other terms, you can't forecast high-impact extreme weather with these models at moderate lead times.

lysecret · 2025-11-18T13:29:15 1763472555

Yes very good point this to me is one of the most magical elements of this loss how it suddenly makes the model "collapse" on one output and the predictions become sharp.

counters · 2025-11-18T17:20:31 1763486431

Yeah, it's underplayed in the the writeup here but the context here is important. The "sharpness" issue was a major impediment to improving the skill and utility of these models. When GDM published GenCast two years ago, there was a lot of excitement because the generative approach seemed to completely eliminate this issue. But, there was a trade-off - GenCast was significantly more expensive to train and run inference with, and there wasn't an obvious way to make improvements there. Still faster than an NWP model, but the edge starts to dull.

FGN (and NVIDIA's FourCastNet-v3) show a new path forward that balances inference/training cost without sacrificing the sharpness of the outputs. And you get well-calibrated ensembles if you run them with random seeds to their noise vectors, too!

This is a much bigger deal than people realize.

lysecret · 2025-11-17T19:33:44 1763408024

To encourage diversity between the different members in an ensemble. I think people are doing very similar things for MOE networks but im not that deep into that topic.

sunshinesnacks · 2025-11-17T21:41:22 1763415682

The goal of using CRPS is to produce an ensemble that is a good probabilistic forecast without needing calibration/post processing.

[edit: "without", not "with"]

rytill · 2025-09-23T19:55:32 1758657332

So, I have heard a number of people say this, and I feel like I'm the person in your conversations saying it's a coarse description and downplays the details. What I don't understand is, what specifically do we gain from thinking of it as a Markov chain.

Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.

jltsiren · 2025-09-23T20:58:42 1758661122

It depends on if you already had experience in using large Markov models for practical purposes.

Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.

Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.

Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.

When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.

rytill · 2025-08-06T17:01:59 1754499719

> Boglehead

> 140% gain on your holdings this year

Choose one.

cosmicgadget · 2025-08-06T17:22:58 1754500978

Generally true but nvda and pltr are normie stocks and can account for these returns from this year.

lerchmo · 2025-08-06T19:29:58 1754508598

Boggle head is basically pick 2-3 vanguard etfs and check back in 25 years.

Breza · 2025-08-15T13:58:59 1755266339

That's my approach. I got my quarterly statement in the mail yesterday. Looks like the market must have gone up over the past three months. Not sure what to do with this information since it's not like I'm going to change anything.

andrepd · 2025-08-06T17:57:02 1754503022

But then it's not a Boglehead lol

fyrabanks · 2025-08-06T22:05:16 1754517916

https://www.bogleheads.org/wiki/Passively_managing_individua...

I understand where you’re coming from, but there isn’t a incongruity. Individual stock investments are a relatively small part of my overall portfolio.

andrepd · 2025-08-06T22:45:30 1754520330

> The discussion here assumes that you are not trying to beat the market, but instead passively managing individual stocks to create your own "DIY index fund."

rytill · 2025-08-03T04:09:52 1754194192

Why would one have motivation to not use activation functions?

To my knowledge they’re a negligible portion of the total compute during training or inference and work well to provide non-linearity.

Very open to learning more.

russfink · 2025-08-03T04:16:37 1754194597

One reason might be expressing the constructs in a different domain, eg homomorphic encrypted evaluators.

mlnomadpy · 2025-08-10T19:42:58 1754854978

they are one of the reasons neural networks are blackbox, we lose information about the data manifold the deeper we go in the network, making it impossible to trace back the output

this preprint is not coming from a standpoint of optimizing the inference/compute, but from trying to create models that we can interpret in the future and control

julius · 2025-08-03T12:23:01 1754223781

Less information loss -> Less params? Please correct me if I got this wrong. The Intro claims:

"The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg- ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge."

mlnomadpy · 2025-08-10T19:45:33 1754855133

yes, since you can learn to represent the same problem with less amount of params, however most of the architectures are optimized for the linear product, so we gotta figure out a new architecture for it

rytill · 2025-07-21T20:39:41 1753130381

Can you please explain the insight about reduced workweeks you are deriving from what you've linked? It is not obvious to me.

mykarakus · 2025-07-28T19:23:58 1753730638

I was trying to point out that the productivity has been steadily increasing without an obvious benefit to the workers (such as pay increase), so basically workers have been producing more without getting a larger share of that increase. Therefore, keeping all the other things constant, reduced workdays might be a benefit for that increased productivity.

rytill · 2025-07-28T21:09:55 1753736995

Okay, that makes perfect sense. Thanks for the explanation!

rytill · 2025-05-22T19:27:35 1747942055

> we’d just have to do it

Highly economically disincentivized collective actions like “pulling the plug on AI” are among the most non-trivial of problems.

Using the word “just” here hand waves the crux.

rytill · 2025-05-10T19:54:38 1746906878

Can you list some long textbooks on a single subject that are amazing?

ayrtondesozzla · 2025-05-11T00:30:44 1746923444

PAIP

RE4B

rytill · 2025-05-08T00:47:17 1746665237

Don’t forget the training data!

caseyy · 2025-05-08T01:08:14 1746666494

We are far from open training data... training data might even be incriminating.

echelon · 2025-05-08T01:51:42 1746669102

100%, though I still feel as though open training data will eventually become a thing. It'll have to be mostly new data, synthetic data, or meticulously curated from public domain / open data.

Synthetic training data sets, even robotically-acquired real world "synthetic" data, can rapidly create training sets. It's just a matter of coordinating these efforts and building high quality data.

I've made a few data sets using Unreal Engine, and I've been wanting to put various objects on turn tables and go out on backpack 3D scan adventures.

Someone will have to pay for it, though.