Hey OP, I found some issues with your code: During SFT, it uses the full trainin...

_carltg · 2025-09-30T21:48:25 1759268905

Yes, this is the main concern I have with this result as well.

In other words, rather than plucking different leaves (augments) from the same branch or tree (source dataset), you should be evaluating it on an entirely different tree.

This paper in essence does not have a validation dataset, it only has a training dataset and evaluates on a subpopulation (even though that population was never trained on)