Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.



I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...


We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.

That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.

Also, it turns out you can do a lot more than just web-scraping [2].

[1] https://vlm.run

[2] https://docs.vlm.run/introduction




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: