Hacker Newsnew | past | comments | ask | show | jobs | submit | TheEspion's commentslogin

Yes!


That's interesting. You should advertise that.

For anyone else wondering, it is actually spelled Incapsula, I would edit my original comment if I could anymore.


The entire anonymous IP infrastructure is indeed built-in. Yes, performance is lower than non-JS-enabled scraping. To counter this we offer a variety of options to turn on/off performance trade-offs. They can be set per job and per page.


That's a feature we never thought of. I can't promise it will be available at release time, but quite possibly later next year.


We approach this the same way e-commerce or gaming companies approach fraud: we'll actively monitor the infrastructure and set up safeguards to prevent spam and other illegitimate activities.


You gain access to the page's own DOM and JavaScript, which lets you call the site's functions to fetch data if it helps you. You can also code your scraper with the same techniques you use when building a page: jQuery, CSS selectors, etc - basically all the good interfaces that have been developed over the past 20 years are available to your scraping code.


Espion is similar to PhantomJS in that they're both headless browsers that you can inject JavaScript into.

Espion comes with a lot more. First there is the infrastructure: processing power, storage, connectivity and IP addresses that you don't have to provision, set up or manage. Second, Espion includes the features that surround extracting data from a site such as job scheduling, data quality monitoring, online debugging and problem resolution and data delivery.

PhantomJS is perfectly viable, but if you need the features I highlighted and use it, you'll have to build a lot yourself to get the job done.


So, is the Espion headless browser based on an existing one? If this is the case, which one?


it's just PhantomJS wrapped up behind a cloud


CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects. A typical example are businesses that want to monitor their competitors' prices or the second hand market for their products.

We have plans to actively prevent the use of our platform for illegitimate purposes (fraud, spam, etc.).


> CAPTCHAs and IP blacklistings are things you encounter routinely in perfectly legitimate and legal web scraping projects

It's only legal if the site TOS says it is.You dont get to decide wether you can scrap websites legally or not.People got sued for scraping,trust me. And it's not even about fraud or spam.


It's more complex than that. Whether the site's TOS are enforceable is a matter of jurisdiction and may depend on the intent of the web scraper.

Ryanair for example has lost several cases where they tried to forbid scraping their website on the groups that data scraping promoted free competition and served consumers in general. See the latest decision here: https://uk.finance.yahoo.com/news/ryanair-suffers-setback-ge....


The legality may also depend on the type of data being collected. For example, it is likely safer to scrape Yelp to gather public facts like business locations and phone numbers versus if the data is "copyrightable" like customer reviews. Both, however, would violate Yelp's TOS. See: http://streetfightmag.com/2013/03/04/legal-battles-erupt-ove...


Does the location of Espion in Mauritius have anything to do with this? Are the servers located there?. Good design of the site and good work.


Actually no, and I don't think our location would protect us from legal liability. Our servers are outside Mauritius, hosting locally would be very expensive and induce latency.


Will it support Java, Flash and HTML5 audio and video?


Very interesting question. HTML5 audio and video are definitely a possibility, if there's a strong use case. If you have a specific idea, I'd be interested to hear it.

I don't expect Flash or Java support. Flash and Java apps that load their data from standard HTTP resources can be scraped regardless of support, the others need reverse engineering anyway and wouldn't fit with HTML scraping.


I use phantomjs as a headless audio player by loading youtube/soundcloud playlists and letting it play, but there is no official flash support on phantomjs anymore. Sorry if you were expecting some novel idea, but it's just me trying to use a measuring tape to drive a screw.


Thanks, we'll definitely do that.


The packages are for a set number of pages, with no timeframe. I don't have prices yet for a load in the millions of pages per hour. What kind of system do you feed data into?


A very large array of mysql databases. Basically, each month we fire up a new fresh database and start streaming data into it. Were currently pulling around 700Gb a month. Our reporting tools/systems run queries across this array. Its actually not that bad speed wise (reports of over 9000 keywords over a 1 week period for top 100 positions on a per hour basis)


We target businesses that want to fulfil their web data extraction needs in-house rather than hiring a third-party provider – which we originally are. Espion was built for our own needs at first and we’re still in the customer discovery phase. I expect we’ll find many use cases in the coming months.


The general sentiment on this thread pretty much sums up the idea of "scraping as a service". To me, there is definitely a legitimate business need to be able scrape. Whether people realize it or not, companies have in-house teams that build custom scraping tools. The challenge for you is going to be able to siphon out the bad actors that may use your service to do things that you would not approve of.


    businesses that want to fulfil their web data extraction needs in-house
Well, since you offer a web scrape tool, that's obvious :) But who is the typical customer who wants to do that?


As someone who provides the scrapping services for years I can indeed confirm there's a lot of totally legit businesses who need the ability to compile the various online data. From press clipping/ soc. media trends monitoring, over various data-mining & analysis tasks, to price comparison and building dropship inventories. This days everyone talks about big data and data analysis, but you first need to collect that data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: