Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there not yet a Source where the web has already been scraped and souped down to just the text? It would seem someone would have created such a thing in order to save LLM training from having to reinvent the wheel.

I understand the web is a dynamic thing but still it would seem to be useful on some level.



Common Crawl, maybe?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: