Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

hey intev,

- Crawlee has out-of-the-box support for headless browser crawling (Playwright). You don't have to install any plugin or set up the middleware. - Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code. You don't have to care about what middleware, settings, and anything are or need to be changed, on the top that we also have templates which makes the learning curve much smaller. - Complete type hint coverage. Which is something Scrapy hasn't completed yet. - Based on standard Asyncio. Integrating Scrapy into a classic asyncio app requires integration of Twisted and asyncio. Which is possible, but not easy, and can result in troubles.



> You don't have to install any plugin or set up the middleware.

That cuts both ways, in true 80/20 fashion: it also means that anyone who isn't on the happy path of the way that crawlee was designed is going to have to edit your python files (`pip install -e` type business) to achieve their goals


I've been working on a crawler recently and honestly you need the flexibility middleware gives you. You can only get so far with reasonable defaults, crawling isn't a one-size fits all kinda thing.


Crawlee isn’t any less configurable than Scrapy. It just uses different, in my personal opinion more approachable, patterns. It makes it easier to start with, but you can tweak whatever you want. Btw, you can add middleware in Crawlee Router.


> Crawlee isn’t any less configurable than Scrapy.

Oh, then I have obviously overlooked how one would be able to determined if a proxy has been blocked and evict it from the pool <https://github.com/rejoiceinhope/scrapy-proxy-pool/blob/b833...> . Or how to use an HTTP cache independent of the "browser" cache (e.g. to allow short-circuiting the actual request if I can prove it is not stale for my needs, which enables recrawls to fix logic bugs or even downloading the actual request-response payloads for making better tests) https://docs.scrapy.org/en/2.11/topics/downloader-middleware...

Unless you meant what I said about "pip install -e && echo glhf" in which case, yes, it's a "simple matter of programming" into a framework that was not designed to be extended


Cache management is also what I had in mind. Ive been using golang+colly and the default caching behavior is just different enough from what I need. I haven't written a custom cache middleware, but I'm getting to that point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: