Hacker Newsnew | past | comments | ask | show | jobs | submit | vertere's commentslogin

I'm confused about their definition of RL.

> ... SFT is a subset of RL.

> The first thing to note about traditional SFT is that the responses in the examples are typically human written. ... But it is also possible to build the dataset using responses from the model we’re about to train. ... This is called Rejection Sampling.

I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?


> I can see why someone might say there's overlap between RL and SFT (or semi-supervised FT), but how is "traditional" SFT considered RL? What is not RL then? Are they saying all supervised learning is a subset of RL, or only if it's fine tuning?

Sutton and Barto define reinforcement learning as "learning what to do- how to map situations to actions-- so as to maximize a numerical reward signal". This is from their textbook on the topic.

That's a pretty broad definition. But the general formulation of RL involves a state of the world and the ability to take different actions given that state. In the context of an LLM, the state could be what has been said so far, and the action could be what token to produce next.

But as you noted, if you take such a broad definition of RL, tons of machine learning is also RL. When people talk about RL they usually mean the more specific thing of letting a model go try things and then be corrected based on the observations of how that turned out.

Supervised learning defines success by matching the labels. Unsupervised learning is about optimizing a known math function (for example, predicting the likelihood that words would appear near each other). Reinforcement learning would maximize a reward function that may not be directly known by the model, and it learns to optimize it by trying things and observing the results and getting a reward/penalty.


A couple things I've seen go by that make the connection. I haven't looked at them closely enough to have an opinion.

> https://arxiv.org/abs/2507.12856

> https://justinchiu.netlify.app/blog/sftrl/


You seem to have a very narrow view of what is a relevant or a valid comment. Just because a counterargument doesn't completely refute the original comment, or "introduces" new concepts, doesn't make it irrelevant or "misdirection".

Someone compared treatment of X 20 years ago to treatment of Y today -- seems pretty natural to bring up treatment of X more recently. You can't just say "the original comment didn't mention it so you can't mention it either".

I don't see how your accusations of bad faith are warranted.


If your doubts are true, why did you have to introduce an analogy?

I feel I have a pretty open view.

And, there's no definition of a 'valid' comment. I'll address points, raise points or whatever. Sorry, I'm not in the realm of 'valid' comments, never was.


> why did you have to introduce an analogy?

GP made a comparison! How do you refute a comparison without criticising the comparison?

> I'm not in the realm of 'valid' comments, never was

Claiming misdirection is an argument about validity.


Indentation of comments also means something. GP was not responding to the neuroscience comment.


It's still pretty meaningless unless you have some idea what the previous profit was (which the average headline reader probably doesn't). 95% of last quarter's profit could be a trivial amount or a huge amount. If last quarter's profit was $100 and this quarter's was $5, describing it as a "plunge" is misleading as it hasn't really changed.


Maybe they don't believe there are no elephants in Germany.

If someone came up to you on the street and said "You are an elephant. Are you an elephant?" you wouldn't say "yes".


Humanities offers 'genre' as a solution to this problem. A test, we would say, might be more expected to seek formal proof based on self-contained assertions. As such, it is different to a conversation in the street.

This leaves the test author with a challenge of establishing genre as test rules-of-engagement.


As an AI language model, I cannot be an elephant sinc I don’t have a physical body or consciousness.

Well, it was worth a try.


Well...I could be the elephant in the room...


I agree that treating production vs non-production as a dichotomy can be problematic, but that doesn't mean some systems aren't more sensitive than others.

Also security is not one dimensional. A system's required level of confidentiality might be very different from its required level of availability. Being explicit about this might be better than trying to lump different requirements into a "production" label.


> is best avoided, phobia or no.

Strongly agree. I briefly developed a morbid fascination a few years ago, which then developed into a rather unpleasant sort of phobia/aversion.


> It's really not a lot to grok, at least by most other language's standards.

Yes, but people have been attracted to Python largely because it's not like a lot of other languages. It is/was concise, simple, dynamic and fairly easy to learn. I think some of the new features, even if they don't make it a worse language, make it less "Pythonic", and so tend to undermine its comparative advantage. For experienced programmers the new features might not seem complicated, but python is used by a lot of people who are not in that category, including people for whom software development isn't their primary job.


I agree for the most part; I have not seen the use of the walrus operator or many new features of the last two major Python releases.

Nontheless, as a user of the language I just don't see people trying to contort their code to use these things. The community has less attraction to flashy features than other languages, so I don't see people getting compelled to use things they don't care for.


Indeed. Any suggestion that economists (neo-classical or otherwise) don't care enough about relative price changes is utterly ridiculous. They just don't call them inflation.


I can't see where they criticize the term "black box". So, ironically, this is just a complaint that they used some terminology you don't approve of.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: