Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Has anyone else noticed Stack Overflow clones in Google search results?
71 points by rchrd2 on Aug 22, 2015 | hide | past | favorite | 65 comments
Has anyone else noticed Stack Overflow clones in Google search results? They come up frequently for me. I can't help but wonder who's behind these. It can't be hurting Stack Overflow's SEO.

So far I have saved 5 different domains, and it looks like 2 have vanished.

- [dead] http://www.codeitive.com/0izVUjjXVP/selective-foreign-key-usage-in-django-maybe-with-limitchoicesto-argument.html

- [dead] http://www.codedisqus.com/0QmqWVgjgg/hide-label-in-django-admin-fieldset-readonly-field.html

- http://w3facility.org/question/image-servingurl-and-google-storage-blobkey-not-working-on-development-server/

- http://goobbe.com/questions/3109325/how-can-i-disable-a-third-party-api-when-executing-django-unit-tests

- http://www.ciiycode.com/0HyN6eQxgjXP/django-admin-inline-popups

I actually wrote Stack Overflow support about this in April 2015, but so far nothing has changed. Here's the thread:

Me: "Hello, There a lots of spam results on Google. As a web developer, I am frequently googling for how to resolve some programming issue. Often, I get these spoofing sites that link to stack overflow.

I would suggest adding a stricter robots.txt or perhaps blocking some of these bots that are scraping your site.

Here is an example. http://www.ciiycode.com/0HyN6eQxgjXP/django-admin-inline-popups

Thank you."

Response: "Hello,

Thank you for reporting this content. I've passed the information along to the person at our company who handles such issues. It's the diligence of users like you that helps us stay valuable!

Please note, bringing these sites into compliance (or getting them to no longer serve our content) is often a long and arduous process. You may not see immediate results. However, rest assured that we're working on it.

Thank you again, Stack Exchange Team"

Thoughts?



Google doesn't have a good way to establish provenance, and has trouble distinguishing copies from originals. It's a common complaint of blog operators that some bigger blog copied their stuff and got a higher ranking on Google.

Google could check when it saw something, but that won't work against fast scrapers. For that, you need trusted timestamps.

One solution to this would be to have a few time-stamping services. You send in a string, probably a hash, and it adds a timestamp, signs it, and sends back a signed result. Then provide a WordPress plug-in to use this service, hashing and time-stamping each blog entry, and putting the result in the HTML in some standard way. (Perhaps <span signed-provenance-timestamp-hash="xxxxx"> blog entry </span>). A few mutually mistrustful services for that would help; blogs with serious forgery problems could use multiple time-stamping services.

Search engines then need to look at timestamps as a rating indicator. If two results are very similar, the earliest one wins.


Isn't part of the core of pagerank that certain domains are more trustworthy than others?

Given that this is supposedly true, wouldn't the same content on stackoverflow out rank a random new domain with the same content


Exactly. In the SEO world, the common wisdom is you will get dinged in ranking or even deindexed if you publish duplicate content. Period. There is an entire sub-industry in SEO that's all about content spinning. I'm always kinda amazed that these sites show up and I usually get them for about 50%+ of my tech searches.

What I think ultimately happens is Google is notified and these sites are eventually deindexed. But this still leads me to really question the whole duplicate content theory sometimes. If Google was so tight about this, these sites would never show up in the first place.

It could also be that the tech terms searched for are not as competitive as, say, medical or diet or celeb terms. Maybe Google's just grabbing at any kind of relevancy at this point, duplicate or not.


This is really surprising. Given that they have so much data, there should at least be several thousand domains that they treat with some special regard when content that replicates them shows up.

Then again, I'm sure that there are plenty of top 1000 domains that would use this to their advantage too, oh well.


pagerank works both at the domain and page level, originally with the emphasis on page level(hence pagerank not domainrank).

While it would be hard for these sites to match the cumulative reach of SO, they will have an easier time getting specific pages to rank highly. This can also be abused with a system where most pages link to the .1% of pages that should be emphasized. In this fashion, smaller sites can throw their weight around and get specific pages to rank more highly than bigger sites.

Trying to solve this issue on a purely technical level turns out to be a lot more complicated than it would seem. It is made much worse by how damaging false-positives are for a search engine (Google censored me!). So the result is that this often only gets resolved by manual actions rather than automatic penalties.


Actually it's pagerank for Larry Page


hence his name is not Larry Domain.


Actually it's not. (In case anyone might think you were being serious.)


Since establishing provenance is such a big problem for Google, perhaps it might be a good idea for Google to offer a time-stamping service itself?


Google offers this service already:

http://www.labnol.org/internet/fat-pings-for-content-scraper...

https://en.wikipedia.org/wiki/PubSubHubbub

https://pubsubhubbub.appspot.com/

Most of the big hosted publishing platforms like WordPress and Blogger already use it, but it's pretty common for sites that built their own codebase not to.

(This was one of my interests at Google, and I had both a 20% project [unreleased] and a real project [Google Authorship] that were devoted to algorithmically detecting copied content and providing reliable attributions. Ultimate though, FatPing/PubSubHubbub [done by another team] was a much more robust way of doing this, with the downside that it's on the webmaster to implement it.)


That's not a timestamping service; that's just a way to quickly notify Google when you add a page to your blog.

What if your blog doesn't use Pubsubhubbub, but mine does, and I copy your blog post? Google will see my post first, but that doesn't prove I wrote it, so Google cannot use that to make ranking decisions.


I don't see how pubsubhubbub differs from the originally proposed timestamping service. For any timestamping service, it will still be the case that if you don't use it, and somebody else scrapes you and does use it, their timestamp will be better than yours.


There can be more than one timestamping service.


This is not an advantage.

Imagine a scenario where a rogue timestamping service performs just well enough to get a significant userbase. They then alter their service such that instead of simply returning the timestamp, they put the content up on the Internet, but assign themself an earlier signed timestamp before returning a timestamp to the original requestor. They would instantly replace all of their customers in Google results, because Google would think that they were the originators of the content. They'd then be free to do all sorts of mischief with that influx of traffic: show ads, install malware, phish addresses, etc.


Thanks for sharing this. I stand corrected (by a Googler, no less!). I wish Google pushed for greater adoption, then, as it does with Page Speed, SEO advice, etc.


Exactly! It seems to me Google could easily provide a proof-of-authorship API, especially for text. (You could publish a hash in some feed in case you don't fully trust Google in turn.) I'm not a big fan of conspiracies and such, but by now I'm pretty sure Google has some conflict of interest given that it hasn't offered such a service already.


Problem as others pointed out is that no matter how you design the system, even with Google a completely as trusted party in this, proof-of-authorship, timestamping-service, it can only work if every original content creator (or platform, like SO) gets on this train.

Say you're a content creator that maybe doesn't want to use this service, simply hasn't heard about it yet, or accidentally publishes a whole archive of original articles they forgot to sign with the proof-of-authorship/timestamping API just before they get scraped by an ill-intentioned content farmer, that quickly uses the API to sign/stamp the articles to themselves before the content creator can. They don't even need to publish them right away, they can drip them out over years and have the proof-of-authorship signature to "prove" their authorship.

If we would actually trust this system, it means that the real original content creator is shit out of luck. There's no way they can prove their authorship in a way that distinguishes them from a scraper, that part remains the same with or without this system--but what would be new is that the scraping non-author now has some sort of extra claim of authenticity over the actual original content creator.

I see no way around this. Unless you devise a way so that all content written (anywhere, any time, on any medium) is immediately signed with a proof-of-authorship, such a system will be giving power to bad actors to claim authorship of any content that happens to slip through without being signed.


It's not directly a problem for Google — it's primarily a problem for sites that create original content.


Not quite. The quality as a search engine could improve dramatically by letting true authors let Google know the content is coming before it's been available anywhere on the web.


Well, I think Google would prefer to send the traffic to StackOverflow instead of it's clone as it's better source?


It's a problem that would disappear if content creators were compensated regardless of serving point.

We don't have that. We could.

Universal content syndication + broadband tax.

https://www.reddit.com/r/dredmorbius/comments/1uotb3/a_modes...


The problem is how one prevents the bad-actors from taking advantage of such a service.


Define "bad actor".

That's not a "they aren't", but a request for "what do you consider to be acting badly?"

Conversely, is content scraping-and-publication possible by good actors?

How?


Timestamping sounds like a good idea. We can easily do it now in a decentralized way without depending on any 3rd party service e.g. https://github.com/chainpoint/chainpoint


Then if the original didn't use timestamping, and the clones did, they'd always win over the original.


Stack Exchange makes all the content available as a downloadable file, so they must be expecting clones.

https://archive.org/details/stackexchange

They then have a bunch of licensing terms if you reuse the content, which must be what they're referring to by "bringing these sites into compliance is often a long and arduous process".


They even have a tool/site to query the data: https://data.stackexchange.com/stackoverflow/query/new


They've been around for years. A while back a few were sometimes beating SO in google results, but google eventually fixed that. Everything on SO is cc-by-sa licensed, so as long as the source is attributed (and it is on some of the sites), it's legal. The motivation is simple: the clone spam sites have ads on them. They're not actively trying to attack SO; they're just leeching value.


I guess it's a win-win situation. Stack Overflow's ranking goes up, because of all the sites referencing it, and the spammers make money.

But as a consequence, the users have to sift through spam results.


I don't think so. Those sites redisplay SO content, but they're not linking to SO. They're actually trying to out-rank them, up their own traffic, and drive up their ad revenue.


Sometimes they link back, sometimes they don't.


See "A site (or scraper) is copying content from Stack Exchange. What do I do?"

http://meta.stackexchange.com/q/200177/234215


Sort of related, but in the last year it seems to me that Google's results in general have been increasingly getting worse. Now it seems that at least two or three spam results are always present in the first page of results. Most of these results also happen to be duplicate content of bigger sites though. I have tried reporting on numerous occasions to Google, but just usual "we will investigate" response and never a word back.


In the cases where the information you're looking for doesn't need to be recent (in my case exercises for a particular sport) you can change the date range of your search to e.g. only include results from prior to 2005. Obviously this isn't an option for recent information, but I've found it useful on occasion to filter out blogspam.


This is common across numerous contexts. I've found what appear to be bots Tweeting my reddit and HN posts (I don't mind), several Reddit clones of various levels of sniffitude, Google+ content harvesting, and some Diaspora syndication (to be expected), again, of various levels of sniffitude.

To the extent that this simply distributes data around, doesn't claim it for its own, and credits source, I'm OK with this. Better even if it follows site-specific licensing. Among my visions for the Web would be content syndication where such schemes would actually directly benefit authors and creators, regardless of where their content is served.


I don't understand why Google can't figure it out and remove these clones. They can do much harder things. Why couldn't them outrank sites based on equal text content or -- much better -- huge presence of ads.


I suppose it's mostly a social/legal problem. Google could get rid of 90% of those scrappers by basically hardcoding a preference for StackOverflow if it has the same content than another site. Same with sites like Wikipedia. But then obviously people will call foul play (probably scrappers and SEO people will be the loudest to complain). So whatever solution Google makes has to be general enough people won't call it unfair, and that makes the problem much more difficult.


They have made inroads in the past, but lately copycats have been cropping up in results (iswwwup.com is one I've been seeing a lot) again. I imagine there's a ranking algo update in the future that might fine-tune this more. To be sure, it's going to be an arms race, since the only purpose of these sites is adsense siphoning.


Scraper spam exists for virtually every website that is even somewhat popular, there is almost nothing site owners can do about it but rely on Google, Bing, DuckDuckGo, etc, to figure out the originator. Google should be able to figure it out and rank the true origin source appropriately, and usually Google is pretty effective at that. But sometimes they're not.

Perhaps what I find ironic is that many of the user "answers" on StackOverflow are basically clone spam themselves, copy/pasted from other websites by some user of the site, usually without sourcing the origin. I have personally found my own unique solutions and code copied verbatim and pasted to answers on the StackExchange network multiple times, outranking my original work without a reference to the origination of course, and I'm sure others have experienced something similar. Perhaps that's what you get with a user generated site, maybe Wikipedia experiences something similar.

Related, some of you may recall a few years back, that StackOverflow basically complained to Google in a public fashion about not ranking well enough and got a boost from them, whereas obviously the average Joe and an average website has no such option nor recourse. Here was the discussion on HN: https://news.ycombinator.com/item?id=2152286


When I look at w3facility.org, it seems like Googles algorithms do not properly handle the case when a scrape-site provides a source-link to the original content.

Google recommends the latter to protect against duplicate content penalties when you use some external content to enrich your site (for example a short section from Wikipedia, imdb actor info, etc).


I haven't seen this yet, and I do searches pretty frequently that result in StackOverflow.com hits... and I make a point of choosing them first, because they are usually the best results.

However, a few days ago, I did see a single hit for expertsexchange.com for first time in years, at least that I noticed. Back when Google used to let you blacklist domains from your search results, before StackExchange was ever around, or when it was still very new, I used to block them, and eventually I'd assumed they'd gone away, but maybe not.

I hope StackOverflow and/or Google are able to do something to put the kibbosh on this kind of thing, because despite the complains SO is still a huge and valuable resource.


> I would suggest adding a stricter robots.txt or perhaps blocking some of these bots that are scraping your site.

The only people who care about robots.txt are some of the big companies. Even Baidu ignores it (As they can, it's purely there as etiquette)

Blocking bots is hard.


I see this happen a lot also with the MSDN forums. The interesting thing there is that some of the mirror sites are still carrying topics that have been deleted or otherwise disappeared from the real MSDN forums. More than once, in the obscure subset of Microsoft tooling that I work in, the only hits that are still alive are on somewhat suspicious looking .ru sites, so in some sense, I am glad these sites do exist - otherwise, I'd be completely SOL trying to figure out why the badly-documented API I'm relying on is barfing up an opaque HRESULT.


It's not just StackEnchange. I noticed the other day there's a Twitter account and website called "@explodingAds" that tweets HN user comments and mirrors them on it's website. I imagine taking content verbatim is just a quick way to build a corpus of search indexed pages that generate page views and ad impressions.


Have been noticing this for the last few months too, very weird.


I don't understand why StackOverflow allows this, yeah creative commons is cool and all but it IS NOT COOL for actual users. I get so annoyed every time i search for something on Google and it leads to an its clone site. It's not like StackOverflow has better search than Google (which is ridiculous). I still have to search on Google if I want quality search result instead of StackOverflow. With more power comes more responsibility. StackOverflow's policy feels too irresponsible in this regard


We use that license because it protects the content from us. No matter who comes along to run Stack Overflow in the future, Stack Overflow can't do something like put up a paywall and lock it up. Someone else'll just be able to host a copy.


Yes, look at what happened to imdb.


Thank you for doing this, by the way.

It'd be nice to figure out the whole Google link ranking thing, but I don't think switching away from Creative Commons is the way.


Well, that was a bad decision with regards to UX because now I have to dig through 20 look-alike sites that don't link back to SO and you're not doing shit about it. Yes, SO comes up first when I search for the exact question but I'm sure you're aware how useless that is. If I knew the exact question to ask, I probably wouldn't be searching around Google.

My time is the most valuable thing to me and with this decision, you've essentially costed me a lot of time and I already paid my way by adding content to your site. I would have much rather seen you enter a simple legal agreement where the content was placed in escrow.


As an end user, I don't care about that at all. If StackOverflow does something like put up a paywall and lock it up, then it will die off and some other site will arise that will replace it, just like some other sites that came before StackOverflow which put up a paywall and faded away. Also, StackOverflow can always change the policy if they want (which probably won't happen for the reason I mentioned), so the license as an excuse doesn't really make sense to me. Especially when it comes at a cost of horrible user experience. Lastly it doesn't seem like StackOverflow is doing much to improve search on the site itself and that's what makes this even worse. I wouldn't be complaining if I could find more relevant StackOverflow results on StackOverflow than searching on Google. How is it that I can find more relevant results on a generic search engine than the site where the contents came from?


A new site may come to replace it, but what of the 10MM+ questions and scads more answers created by the Stack Overflow community over the last 7 years? Without the CC license all that hard work gets lost in the case SO goes rogue.


To be fair, useful life of many of those answers is likely only a few years at best. Though I generally agree: locking up contributed content is exceptionally poor Internet citizenship (Quora, Scribd).


Basically, I'm not willing to supply useful data to a site which then profits off of it without also making it open.

So if they weren't open then they wouldn't be getting my answers (or a bunch of other people's).


Before StackOverflow existed sites like ExpertExchange. They are irrelevant now despite the fact that they had tons of content whereas StackOverflow started with 0. People ask questions and answer on StackOverflow not because that's where the most content is, but because that's where most people are. I think license is just a manifestation of the company's philosophy, and personally I don't think I would care much even if StackOverflow changed their license tomorrow BUT kept their philosophy and policy in tact. (Which they should, since they know better than to alienate their users and they've seen what happened to all the sites that alienated their users)


The license means that even if SO changes its policy and puts up a paywall, a clone can come up, use the knowledge already present on SO that was scraped.

This means that even if SO puts a paywall, the knowledge gathered there is not lost. And this, IMO, is a very, VERY important aspect of StackOverflow.

As far as search goes, I didn't even know SO had a search function. I see nothing wrong with letting google handle it. Adding a search engine to your website is hard and takes time.


Wow i'm getting downvoted like crazy. I don't think I said anything that's not factual. At least explain why you think I am wrong if you're gonna downvote. To be clear, I love Stackoverflow and I don't know what the world would have been like if it wasn't around, but I do think there are things that are broken and I just mentioned them. Am I supposed to keep quiet because that's how it's been?


> If StackOverflow does something like put up a paywall and lock it up, then it will die off and some other site will arise that will replace it

As others have already said that possible thanks to their liberal license, not despite of it, so you're wrong, hence the downvotes (though I haven't voted in any way).


As a thought experiment, do you think if StackOverflow changed their license tomorrow (hypothetically because of the spam problem) but assured all its users that they will never mess with them, all users will leave simply because of the license?


"Please resist commenting about being downvoted. It never does any good, and it makes boring reading."

https://news.ycombinator.com/newsguidelines.html

(NB: I didn't, and tend to agree it's excessive here.)


I once contributed to a localized version of SO whose licence was not open. The startup backing it eventually shutdown and the site disappeared. All time invested in those answers is now irrevocably lost.

Non-open content licence is exactly what made me stop contributing to sites such as Wikimapia and start contributing to OpenStreetMap instead.

Once you get burned a few times, you learn it's better to have to put up with a few clone sites than risk losing all the data, or having to pay to read what you wrote yourself a few months ago.


Thanks for the answer. I just want to ask one more question. Don't you think StackOverflow is already too large to disappear suddenly like those startups you mentioned, and smart enough to NOT go the way of ExpertsExchange? (I think the founder explicitly mentioned StackOverflow wanted to be an anti ExpertsExchange when they launched. Also I don't think the company would be stupid enough to alienate their users which will lead to demise regardless of how much content they already have)


Just append your query text with 'site:stackoverflow.com'.



There's also a lot of Trello clones.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: