Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
swordsmith
4 months ago
|
parent
|
context
|
favorite
| on:
The inefficiency of RL, and implications for RLVR ...
Seems like he thinks RLVR == learning from binary reward for the whole chain, completely discounting techniques to provide denser rewards like process reward supervision?
Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: