Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hey OP, I found some issues with your code:

During SFT, it uses the full training dataset[1]:

df = pd.read_csv('data/extraction_training_data.csv')

And during the evaluation, it uses the middle part of the same dataset[2]:

df = pd.read_csv('data/extraction_training_data.csv')

df = df[100000:100000+NUM_TEST_SAMPLES]

Also, you split train/test/val by chunk and not by document[3]. Then, the model "has seen" the documents that you're using to evaluate it (even if you're not evaluating it on the same chunks).

[1]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...

[2]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...

[3]: https://github.com/herniqeu/extract0/blob/0f8696a6fb1b620658...



Yes, this is the main concern I have with this result as well.

In other words, rather than plucking different leaves (augments) from the same branch or tree (source dataset), you should be evaluating it on an entirely different tree.

This paper in essence does not have a validation dataset, it only has a training dataset and evaluates on a subpopulation (even though that population was never trained on)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: