Ask HN: Do you scrape data? What do you use it for?

klez · on April 17, 2018

I do scrape. We are making a sort of meta-search engine for long term car rental, so to bootstrap it we are scraping offer from various sites and directing users to the original websites to actually go through with the offer.

If anyone is interested in more detail, I can explain further.

deadcoder0904 · on April 18, 2018

Go on

klez · on April 19, 2018

Ok, here it goes.

The scrapers are written in python 2, because that's what the guys who started the project were familiar with.

Most of the scraping is done by hand with XPath queries on the pages we fetched, so no beautifulsoup or stuff like that. Again, I think mostly because those who started the project were not familiar with the task. It's not even that bad when I need to modify something (because the page changes etc.), as the code is very well written.

The problems started when the CEO and CTO proposed (mandated?) to use something made by a guy who is supposed to be a web scraping expert in the same domain we're working in (I don't doubt that, but still...). The software it gave us is written in Ruby (which no-one here ever even saw a line of) and Rails, and works with recipes instead of the imperative code + xpath we used at the beginning. It works flawlessly until it doesn't. Mainly because there's a big logical error (if an offer disappears from the original website we should mark it as deleted on our db, but the scrapers tells us it's still there) and I don't have time to learn ruby, rails and the whole system to fix this. And the original dev is not available anymore. So we're phasing that out and going back to our nice land of python :)

Anyway the process goes like this:

1. Scraper fetches the page, scrapes the data, and generates a JSON file with the info of all the offers it finds

2. Those JSON files are uploaded on S3

3. A trigger on S3 calls the "writer" on a EC2 instance, that downloads the JSON file, unpacks the content and writes the data to a postgres database.

Current problem: scraping arbitrary strings representing car options and categorizing them. We have something like 15,000 strings that need to be put inside a category. Manually.

xstartup · on April 17, 2018

Ad scraping can be very profitable.

See: https://adplexity.com/

is_true · on April 19, 2018

I don't understand something. How do they know how many impressions ads get? They are just scraping or they have access to ad networks stats?

xstartup · on April 19, 2018

It's determined based on guesswork.

If you scrap a website, you'll know how what % of times an ad shows up.

Now, if you know the distribution for top N ads. You can skip everything and go straight to the ad network and check their total number of impression volume on particular website/GEO. You can pretend to be an advertiser and easily get access to this info.

Now, you know the distribution and total, can't you work out the impression count for each ad?

is_true · on April 20, 2018

skinnymuch · on April 20, 2018

How profitable is Adplexity? It’s used as an example.