Feb 102013
 

copyrightThis isn’t the first time I’ve written about scrapers, and I’m sure most of you have heard about this happening. I’m going to do my best to help you figure out what to do and how to fix it the next time you see your fellow bloggers tweeting about yet another site who has stolen blogger content.

Usually what happens is that an unscrupulous site will use a “scraper” program which copies the content of your RSS feed word for word, link for link, and automatically posts it on their site. Sometimes they will not do anything at all to your post; it goes up on their site, links photos and all. The (slight) upside to this is that if anybody is reading this site, they’ll click on the links and eventually get back to your site. But that’s really not enough of an upside. They are using your post as free content to pad their site for SEO purposes which will in turn net them more advertising.

Sometimes they will put your post up and the post title will link directly back to your page, not to the post on their site. These assholes believe that that is “attribution” and they’re in the clear. No. I recently had to deal with such a jackhole who is still following my blog. Despite my comments on his site to remove my copyrighted shit, and his eloquent email that I quoted on Twitter, and despite me reporting him to HostGator and his posts being removed, he still is trying to add me to places like GooglePlus. FYI: RSS feeds do not at all give someone like him permission to use your content. Their blog/site is not a feed reader; a feed reader is the only thing allowed to publish an RSS feed like he had done. They will also try to call this “re-blogging” and it is not. See: Ethical Blogging Practices

Sometimes they will remove your photos (or if there were none, add their own) and replace them with porn-y pics. Sometimes they’ll take it a step further and replace any links in your post to links that they choose, or they’ll add in extra links for keyword farming. This is what ScandalShack.com did to Mina and many other bloggers back in 2011.

ZOMG But It’s Duplicate Content and Google Will HATE Me!

When they talk about “duplicate content” they’re usually referring to it happening from within your own site. Like you search for a review on the Lelo Mona and it shows up on Google once due to it being a recent post and the title is in your sidebar, a second time under the category “Reviews”, a third time under the tag “vibrators”, a fourth time under the tag “Lelo” and so on. But when it comes to “duplicate content” due to being scraped, 9 times out of 10, Google knows that your post showed up first and is the real post. You won’t be penalized for it.

Says Google:
Before diving in, I’d like to briefly touch on a concern webmasters often voice: in most cases a webmaster has no influence on third parties that scrape and redistribute content without the webmaster’s consent. We realize that this is not the fault of the affected webmaster, which in turn means that identical content showing up on several sites in itself is not inherently regarded as a violation of our webmaster guidelines. This simply leads to further processes with the intent of determining the original source of the content—something Google is quite good at, as in most cases the original content can be correctly identified, resulting in no negative effects for the site that originated the content.

But that’s not why I care – I worked hard on my damn content and I don’t allow others to use it and indirectly profit from it or claim it as their own. I own the copyright. Even if I didn’t have copyright notices out the yingyang here, it’s an unspoken thing, this whole “blog” copyright business. I created the content, I own it. Just like anything on the internet. Creator Owns All.

The Hostess With the Mostest

The entity that will be following the laws of DMCA is the host of the site. Not the domain registrar. Sometimes, though, figuring out who is hosting it isn’t that easy if you don’t know what you’re doing. The tried-and-true method is to use a site called who.is. But what happens? I’m going to use the site who most recently scraped me and I stupidly tried to engage with the site owner (it never, ever works…trust me), the one I mentioned above.

What you'll see when you do a Who.is on a domain

So who.is talks about a lot of stuff there, and what do you see first? GoDaddy. Nope, that’s not the host. That’s the registrar – who they bought the domain from. Many places don’t use the same company for both hosting and domain registration. The word “host” is never used here, but it’s hiding down there in the “nameserver”. Hostgator. Ok, that’s easy, they’re a major hosting company.  Whatever it says in nameserver, basically, just type that in as a site and it usually will take you to a hosting company.

What you'll see when you use Whoishostingthis.comBut in searching for a better way to locate a host, I found another site: Whoishostingthis.com. Supposedly this site will tell you exactly who is hosting the site, in plain English. Except…maybe not. For the site above, it claims WebsiteWelcome is the host. Typing in that as a site comes up with a text-only page that tells you to email abuse@websitewelcome for any copyright complaints. Weird, right? So I did a little Google-fu and find that WebsiteWelcome is indeed related to Hostgator. They are a private reseller label or something. But I had already contacted HostGator and they responded appropriately, meaning they are the  host. If a company is not the host, they will respond and tell you that they’re not. Half the time they’ll tell you who IS.

Let’s try another. Don’t ask me why but as I sat there trying to think up a random, porn-y site address the first thing that popped into my head was midgetporn. So that’s what I went with. Who.is says that the nameserver is he.net. Typing that in takes me to a site that appears to maybe be a little out of date, Hurricane Electric hosting. They don’t have anything obvious up for copyright claims/DMCA takedowns; it takes a lot of digging. They don’t list a contact for that in their contacts list; I had to go locate their Terms of Service under the Legal page to locate their copyright claims email.

But what if I had gone to Whoishostingthis.com? Hmm. They tell me that the (likely a reseller) is “V Entertainment”. Just like above with the WebsiteWelcome company, typing in ventertainment.com gives me not much – but it does give a contact form for “issues with any of our member  sites”.

Hosting Reseller: The problem with using Whoishostingthis.com is that they’re listing the reseller. Many times the reseller IS the site owner, or is just as shady as the site owner. You need to go to the nameserver for maximum effect.

Private Nameservers: You might come across a private nameserver, which would look like ns1.midgetporn.com. A realistic case: I looked up another popular type of spammy site, the work from home arena. Literally, I who.is’d workfromhome.com. Bingo! Their nameserver? name-server.com. Go there and you’ll see a basic holding page which just contains more spammy advertising links to related things. So what about the who.is on name-server.com? It’s more of a circlejerk, but you’ll see the same registrar as the workfromhome – ENOM. Given all that, I would start with the registrar if workfromhome.com was scraping or stealing my content. I would hope that they could point me in the right direction.

I Have No Fucking Idea Nameservers: Twice I’ve dealt with sites where the nameserver wasn’t easy to pin to a host. Once it was Moniker Services for the registrar but monikerdns.net for the NS and I don’t even know how I found their host. I’m sorry. I’m hoping someone else will be able to shed light in comments.

NOW WHAT?

Ironically, you don’t want to push the host to take down their whole site. Why? As a rep from a hosting company once told me, if they take down the site, the site could potentially be back up online in as little as 10 minutes with the person going to an “unscrupulous” “Russian or Chinese” host. And then, apparently, you’re screwed? But if they just take down the page(s) in question, eventually the site owner will stop targeting you, usually fairly quickly.

Also, you can’t report content theft unless you are the owner of the content being stolen. So if you find something of Violet Blue’s, you can’t tell the host to remove it. You don’t own the original, she does. They only want to hear from you.

Many places will have a form online for you to fill out. Some have nothing but an email address. In that case, fill out a standard DMCA form letter and send it to them. With Hostgator, I had to fax them. Who faxes in this century?? Apparently HG does. I wasn’t about to trot off to Staples so I found one of those free, online fax services that will send it for free if you agree to embed advertising. You’re not the one receiving the fax so it doesn’t matter. Hostgator sent me a canned response within minutes of receiving the fax. When the requisite 48 hours for the site owner to Do The Right Thing has passed and they have not, in fact, done the right thing, HostGator emails you to tell you that they’ve forcibly removed the content and you’re done. If your content is on a blogspot blog, that’s the easiest DMCA you’ll ever do, since there is a link in the nav bar above all Blogspot blogs that allows you to report content theft/spam/etc.

Below is a list of some hosting companies and how to contact them, borrowed from PlagiarismToday.com. The post containing links to various sites and hosts is horribly outdated, written years ago, and is missing a few hosts (like HostGator) but there are so many hosting companies that they cannot all be listed. I’ll list whichever ones anybody comments with and update this part.

Blog Networks

Blogger/Blogspot = Google
DeadJournal (see last item) (email)
LiveJournal (email)
Typepad (email)
WordPress.com (email)
Yahoo! 360 (email)

Domain Hosts

BlueHost (See: Abuse department) (email)
DirectNIC (See: 20.s) (email)
Dreamhost (email)
Enom.com (email)
Godaddy
(email)
HostGator
iPowerWeb (email)
MediaTemple (email)
Midphase (email)
Network Solution (See: Copyright Complaints) (email)
Rackspace (See: Copyright Infringement Notice) (email)
Register.com (email)
Surpass Hosting (mail)
Westhost (email)
WildWestDomains
(email)
Verio (email)
XO (email)
Yahoo Web Hosting (email)
YellowFiber (email)

How To Stop a Predator

You can’t prevent RSS scraping. There used to be a WordPress plugin called nomoreframe, but it works no more. The bots found a different way. So basically you just need to add in things to your RSS that mention copyright, link back to your blog, etc. These things, though, will only help you out if they are scraping your RSS feed. If they are taking the long way around which involves copying your text content and replacing links with ads and adding in porn photos, then there likely isn’t a whole lot you can do to prevent it. You can only hope that they leave in at least one link.

Why? If you have enabled ping/trackbacks on your posts then you will get notified by WordPress or Blogger when something links to you. For awhile there I was turning off pingbacks because of things like Pleasurists and e[lust], I don’t like to see those things clogging up the comments section. I suspect some people leave them as a way to show that their post was well-liked, a vanity thing, but as a reader and blog owner I find they just add visual clutter. So I have the trackbacks on again but I don’t ever publish them. If it weren’t for the trackback I wouldn’t have known that the illustrious B T Phillips was stealing my content.

©Feed: “Extends the feed! A report of copyright, a digital fingerprint and the IP of the feed reader can be added. In addition, some search engines are scanned for the digital fingerprint in order to find possible content theft. The feed can be also be supplemented with comments and topic-relevant contributions.” This is the primary plugin that I recommend. You can add links back to your page, a copyright notice, and the digital fingerprint will help you find sources of scraping (but it will also show allowed sources, like feed readers).

 

If you have dealt with a host that isn’t listed, please comment and let us know. I’ll add it in. If you use any other methods for prevention, control or hunting people down, tell us your best methods.

  • http://heyepiphora.com Epiphora

    Very good post! I agree that pingbacks must be enabled to ever find out about this kind of shit.

  • http://www.xeromag.com Franklin Veaux

    Interesting and timely post, as I just this evening sent a DMCA takedown notice to GoDaddy, who are hosting a site that lifted several pages from my site verbatim.

    One tool to help find a Web host is spamcop.net — you need to register for a free account, and then just paste the offending URL into the box and it’ll identify the host.

    Spamcop identifies web hosting companies by the IP address of the web site, rather than from name server entries. Sometimes, it will identify the hosting company above the reseller account level. For example, I am hosted by hostgator/websitewelcome, but it identifies my host as theplanet.com because Hostgator leases their ip range from The Planet. Complain to The Planet and it’ll get transferred to Hostgator (who, by the way, respond to complaints quickly).

  • http://www.screaming-violet.com Violet

    I can’t thank you enough for listing the place to go for blogspot hosted sites with stolen content. I just flapped about for an hour googling everything I could think of and failing.

    And just in case my tweet malfunctioned this site is scraping you also http://www.blogkeen.com/view_blog.aspx?id=52130541