Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you just want the text there are other ways to do that. You could dump out document.body.innerText for example - here's how to do that with https://shot-scraper.datasette.io/en/stable/javascript.html

    shot-scraper javascript youtube.com 'document.body.innerText' -r
Output: https://gist.github.com/simonw/f497c90ca717006d0ee286ab086fb...

Or access the accessibility tree of the page using https://shot-scraper.datasette.io/en/stable/accessibility.ht...

    shot-scraper accessibility youtube.com
Output here: https://gist.github.com/simonw/5174380dcd8c979af02e3dd74051a...


Of course, if the document is using the outline in unexpected ways, you'll run into trouble. Consider Facebook infamously splitting "Advertisement" into multiple spans to avoid tripping ad blockers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: