|Website categorized as pure spam||Susanna Siebert||10/28/13 1:27 PM|
I've read the FAQs and searched the help center.
My URL is: microbialgenomics.org
We just recently released microbialgenomics.org and I tried to get the website indexed with Google. Today I saw that it has been categorized at pure spam.
We created microbialgenomics.org as a website for our group to highlight our research work. We wanted it to be a comprehensive site where people can see the projects we work on, the publications we have published, and the microbial organisms we have sequenced. We wanted all of this content to be connected so that people can, for example, see what publications we have for a given organism or what organism is studied in a given project.
In regards, to the pure spam categorization I have read the guidelines, searched the internet and talked to a few people. We have some idea on what might have triggered the pure spam designation.
1) We copied some content from Pubmed for the publications. We wanted to list all the publications that people in our group have been involved on the website. It's a substantial amount of publications. For each publication we show the abstract and link to Pubmed for more information. Yes, we're duplicating content (the abstract) but the abstract is the abstract and can't be changed. Plus, we (the authors) originally wrote the abstracts so, technically, Pubmed is copying from us. We are adding additional value to the pages by linking each publication to the biography of the authors on our page, as well as the related projects and organisms.
2) The project landing page (http://microbialgenomics.org/projects/) lists the diseases that we research. A colleague pointed out that this might get interpreted as spam. In addition the the pages that each disease link to could be categorized as shallow content as it's basically a list of the projects that fit this disease (e.g., http://microbialgenomics.org/disease/acne/ lists the two projects that research acne).
3) Maybe our site got hacked. The site isn't showing any signs of hacking. I compared the files with a backup we made before it went live and I didn't see anything amiss. The access logs also don't show any suspicious activities. However, I'm not an expert in that matter so it's entirely possible that I'm missing something.
These are our guesses. I would greatly appreciate it if I could get some pointers from people that are more experienced with Google's indexing mechanism as to what might have prompted Google to characterize this page as spam and what I can do to fix this issue. I've updated the robots.txt file and sitemap to exclude the publication pages from being indexed in the hopes of that helping with the first issue. I'm not sure if that would help in the website not being categorized as spam.
This is a legitimate site. We're not trying to sell anything. There are no ads, referral links or similar on these pages. We're simply trying to present our research in a comprehensive manner.
|Re: Website categorized as pure spam||ets||10/28/13 1:36 PM|
Did you buy a domain with a bad history? How long has it been operational? I see it registered on these dates:
However, if I look at the Wayback Machine, I see these previous incarnations of the domain:
If you're certain you've done nothing that would merit a "pure spam" designation, submit a reconsideration request and explain to Google about your new ownership. Explain fully who you are and detail your academic affiliation with "Washington University in St. Louis" so they get the full picture.
"In addition, if you recently purchased a domain that you think may have violated our guidelines before you owned it, you can use the reconsideration request form to let us know that you recently acquired the site and that it now adheres to the guidelines."
|Re: Website categorized as pure spam||Susanna Siebert||10/28/13 1:59 PM|
Thank you so much. I was wrecking my brain trying to come up with a reason that our website was listed as spam. This must be it. I will put in a reconsideration request.
|Re: Website categorized as pure spam||ets||10/28/13 2:02 PM|
It should take no more than a couple of weeks, maybe less. If they decline it, please come back and we'll take another look for you.
|(unknown)||10/28/13 2:04 PM||<This message has been deleted.>|
|Re: Website categorized as pure spam||Redleg x3||10/28/13 2:17 PM|
hey Susanna, not a big deal but there are some URLs floating around like
www . microbialgenomics . org /tag/welsh-corgi
www . microbialgenomics . org /tag/satin-bedding
right now if I request any of those old URLs they first 301 redirect to the non-www version of the URL and then return a 404. Generally Google would prefer just to see the 404 without the redirect for files that do not exist. Again not a big issue but something you should look at at some point.
welsh-corgi, decorative-pillows, satin-bedding Must have been an interesting site before you acquired it.
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:12 PM|
I received a message that our site is still violating the Google Webmaster Tools. Unfortunately the message is totally unspecific as to what the problem is:
We've reviewed your site and we believe that http://microbialgenomics.org/ still violates our quality guidelines. In order to preserve the quality of our search engine, pages from http://microbialgenomics.org/ may not appear or may not rank as highly in Google's search results, or may otherwise be considered to be less trustworthy than sites which follow the quality guidelines.
For more specific information about the status of your site, visit the Manual Actions page in Webmaster Tools. From there, you may request reconsideration of your site again when you believe your site no longer violates the quality guidelines.
If you have additional questions about how to resolve this issue, please see our Webmaster Help Forum."
Redleg x3 already mentioned the 301 redirect. I'm actually not quite sure how to fix it, so if anybody has any pointers how to do that in Wordpress, that would be great.
It would be great if you guys could have another look at my site to determine what is wrong with it. I mention a few concerns in my original message so that would be a good place to start.
|Re: Website categorized as pure spam||ets||11/4/13 12:33 PM|
Did you do this:
Do you see anything further there... or just "pure spam"?
|Re: Website categorized as pure spam||Suzanneh||11/4/13 12:34 PM|
>>3) Maybe our site got hacked.
Did you check Security Issues in Webmaster Tools?
|Re: Website categorized as pure spam||Ben Griffiths||11/4/13 12:34 PM|
The fact Redleg had a look probably means you can rule out a hack, or they'd have picked up on it. But yeah new-and-improved WMT has a specific section for that.
I'd say it's autogenerated/scraped content, but that is a guess. Very harsh penalty if so, IMO; even if the site has little merit from Google's POV that's what the algorithm is for (certainly a lot of the site is very thin eg http://microbialgenomics.org/disease/nonalcoholic-fatty-liver-disease-nafld/ )
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:43 PM|
Just "pure spam"
|Re: Website categorized as pure spam||ets||11/4/13 12:43 PM|
That's a fair guess. Let's investigate that:
Susanna.... How is the site assembled? Where does the content come from? Are pages like this...
...pulled in by some automatic process from things like pubmed?:
My guess was maybe some connection between the medical content and, say, a Viagra/cialis/drug type spam hack - but with the site totally deindexed, it's hard to know. The bing cache is pretty empty.
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:46 PM|
Ben, do you think that removing these thin sites would make a difference?
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:46 PM|
Suzanne, I checked that and it doesn't pick up on anything.
|Re: Website categorized as pure spam||Ashley||11/4/13 12:50 PM|
when you get it taken care of - look into latency. That nearly-empty page Ben linked to above took forever to load for me.
And make the titles/images clickable so I can get in
Curious about why you're disallowing Google to crawl your people profiles?
Definitely still an issue, and not just for Google. Check out what Bing has indexed
They do seem to be returning 404s properly now - so you may need some patience. But that doesn't totally explain why none of your content is otherwise indexed other than Google must see something pretty bad.
Why would I go to your site instead of http://genome.wustl.edu/ which seems to be the official site?
I dug into one post - http://microbialgenomics.org/organisms/vibrio-harveyi/
But look here: http://genome.wustl.edu/genomes/detail/vibrio-harveyi/
Same content, and indexed well on the main domain.
Your strategy of another site needs some serious rethinking.
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:52 PM|
ets, yes Google thinking that we're victim of a drug hack was our concern too. The publication information are pulled in through a semi-automated process. We pull down the information through the pubmed API, put it in a spreadsheet, and upload it to our website via a csv importer. So the information is not pulled from pubmed in real-time but rather stored separately in our database.
|Re: Website categorized as pure spam||Ben Griffiths||11/4/13 12:56 PM|
Ben, do you think that removing these thin sites would make a difference?
Not likely, possible though. NOINDEX would be my preferred route, but "Pure spam" is very very unlikely to be just that you have a bunch of thin pages.
I looked for ExitJunction code, old spam URLs not returning 404s, all sorts, nothing. It could be that the old version got hit with a Spam penalty and whatever secret tools the Spam Team uses is picking up something that jibes with the original cause, but explicitly telling them you're the new owner should overcome that you'd think.
I'd imagine someone somewhere is going to bump this up the ladder if none of us can put our fingers on it - although without your credentials it would just be A N Other scraped site, so maybe not.
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 12:59 PM|
I know that latency, unfortunately, seems to be an issue for us but I'm not sure what causes it.
Can you tell me which titles/images are not clickable for you?
Unfortunately, this had to be done because Wordpress's custom post types work in weird ways. We wanted to be able to link our collaborators to our projects/publications etc but we didn't want to have individual posts for them. Unfortunately, you can't have the first without having the latter. So we made posts but we don't want people to actually go to them. It's certainly not ideal.
How can I get rid of those? Disallow?
The main site does not represent our work adequately. We needed our own website that focuses on microbial work only, to fulfill outside requirements.
|Re: Website categorized as pure spam||ets||11/4/13 1:05 PM|
Can you please do a fetch as Google on your home page and paste the entire thing in here, headers and code? (gosh I am going to be popular) I'm just thinking "rule out IP-address based cloaking/hack").
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 1:10 PM|
Fetch as Google
This is how Googlebot fetched the page.
Date: Monday, November 4, 2013 at 1:09:51 PM PST
Googlebot Type: Web
Download Time (in milliseconds): 1870
|Re: Website categorized as pure spam||Ashley||11/4/13 1:10 PM|
If you insist on having two sites, then you need to fill it with UNIQUE content. What percentage of the new website is totally unique? Because I'm not seeing anything.....
|Re: Website categorized as pure spam||Ashley||11/4/13 1:11 PM|
I'll do a few different header checks for various UAs/referrers and see if I find anything...
|Re: Website categorized as pure spam||Ashley||11/4/13 1:12 PM|
You are on Wordpress 3.5.1
Wordpress is on 3.7.1
You NEED to update otherwise you'll keep getting hacked.
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 1:21 PM|
You're bringing up a valid point. Things that are totally unique are:
- Our Mission
- Resources & Environment and their subpages
- All the projects
- Most of the organisms
- The team biographies
- All the linking between the all of these different parts - this is an added value that may not be apparent to the outside observer but on this website we have everything on one page, all linked up so that collaborators, researchers etc can access everything about our work in one place.
Yes, publications and organisms are an issue. For publications especially since there are >1000 that were published by us and were imported but there is just no good way to present them in any other way. All this information is static. So unless we want to scrape all the publications, which would dramatically reduce the informational content of our site (i.e., the linking I talked about) I just don't know how to fix it. What do you guys suggest? I'm also interested in knowing how other sites can reproduce publication abstracts and it not being counted as duplicate content? I mean that kinda stuff is everywhere on research websites.
For organisms we can rewrite the content but even then I'm afraid that Google would count it as scraped. Do you guys think rewriting this content would help?
|Re: Website categorized as pure spam||ets||11/4/13 1:23 PM|
Well apart from what Ashley said, that reveals nothing.
Have you tried downloading your htaccess file from the server and looking in there for any suspicious redirect code? It is important to look at the version on the server not the one you upload (if you upload one).
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 1:23 PM|
Ashley, we plan to update once we can make sure that our site will be staple and that all of our plugins will still work.
Are you saying that we definitely have been hacked?
|Re: Website categorized as pure spam||ets||11/4/13 1:25 PM|
We're trying to find out :)
|Re: Website categorized as pure spam||Ben Griffiths||11/4/13 1:28 PM|
I ran a couple of malware scans fine, but it did pick up on the outdated wordpress.
For what it's worth, the site used to be an Amazon affiliate auto-gen site:
That would tie in with the penalty being for this
Little or No Original Content:
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 1:31 PM|
I looked at it back when this first became an issue. I didn't find anything suspicious. Unfortunately, I have to leave work for today so I asked one of my colleagues, Michael, to have another look. He'll update this thread with his findings.
On Monday, November 4, 2013 3:23:26 PM UTC-6, ets wrote:
|Re: Website categorized as pure spam||Susanna Siebert||11/4/13 1:33 PM|
Ben, that makes sense. How is this still affecting the site today since I made the reconsideration request?
|Re: Website categorized as pure spam||Ben Griffiths||11/4/13 1:35 PM|
It's not, but given, as Ashley points out, the site is a near-duplicate of an authoritative source it could well be that the penalty is still relevant.
|Re: Website categorized as pure spam||ets||11/4/13 1:36 PM|
Yes, but Susanna put in a RR last week saying she'd taken over the site from the cushion and pillow spammer who owned it before.
|Re: Website categorized as pure spam||Lysis||11/4/13 1:44 PM|
Bing doesn't like the site either.
|Re: Website categorized as pure spam||Ben Griffiths||11/4/13 1:45 PM|
Yeah, to the extent that I was using Yandex to try and diagnose it. That only has 7 pages indexed too though.
|Re: Website categorized as pure spam||Michael Kiwala||11/4/13 2:01 PM|
I just looked at the .htaccess file. Nothing suspicious, just the typical WordPress boilerplate.
|Re: Website categorized as pure spam||ets||11/4/13 2:04 PM|
So why are no search engines indexing it when it clearly has plenty of pages?
It's not as if Bing is particularly fussy about "pure spam" - ho ho - yet manages to index only the home page?
If if it were Google saying "The pages are thin/duplicate", it would still be abundantly indexed on Bing.
|Re: Website categorized as pure spam||Susanna S||11/4/13 4:40 PM|
Good question. I did submit a sitemap to Bing. I now also submitted the main urls of the landing pages to be indexed. Do you guys think that the past content resulted in the robots not crawling the page?
|Re: Website categorized as pure spam||ets||11/4/13 11:52 PM|
In the absence of any other ideas, as Ben suggested, you could try noindexing: perhaps noindex everything off /publications/ and put in another RR. Does it matter to you whether that subdirectory is indexed or not? The content is all on pubmed anyway. You could even remove the entire subdirectory from Google's index (or selectively noindex it for Googlebot):
Remove a page or site from Google's search results
When NOT to use the URL removal tool
However, IMHO, in cases such as this, the manual webspam team should actually say explicitly what is wrong. They should have some leeway. I appreciate that the idea is to be fair by treating everyone the same - but everyone is not the same. You are a university department researching infectious diseases, not a bunch of Pokemon spammers in Jakarta. I think a steer is appropriate in this case, please, Google?
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 7:41 AM|
Thanks for the suggestion. I really appreciate it.
Am I understanding this correctly that if I disallow the pages with duplicate content in my robots.txt and block the urls in the webmaster tools, that Google will not take those pages into consideration when making their "pure spam" determination? We don't really care for all the publications to be indexed. The only page we care about is the landing page at microbialgenomic.org/publications/. Similarly for organisms. Do you think it would be a good idea to do the same there? What about the thin pages?
|Re: Website categorized as pure spam||ets||11/5/13 8:02 AM|
What's the consensus here? Should Susanna "noindex" her thinner stuff?
I would say "yes" on the grounds that it will never help the site to have an exact duplicate of something as authoritative as pubmed.
|Re: Website categorized as pure spam||Ashley||11/5/13 8:15 AM|
I don't know - I'm honestly still confused as to why there should be two separate sites! If it's the same org/school - why are you not creating a single website? That is so much less confusing for bots and users.
Seems to me like there's some serious hangup on the actual domain - but even if that is cleaned up I'm still struggling with the decision to go forward with these two sites and a large amount of duplication and thin content.
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 8:55 AM|
It wasn't our choice. We have to have our own website. It's a requirement for our work. Unfortunately, I can't go into detail as to why it's a requirement. You will just have to trust me that it is.
In the context of the existing genome.wustl.edu website our work there is not presented adequately. Let me give you an example where the true value of our site is coming in. Let's say someone is looking for information on research concerning Daptomycin-resistant Enterococci and they end up on our Daptomycin-resistant Enterococci project site (http://microbialgenomics.org/projects/daptomycin-resistant-enterococci/). They can see a whole bunch of information about the project, who our collaborators are, what organisms are involved in this research, and what publications where written for this project. If they now want to know more about one of the disease agents, lets say E. faecalis, they don't need to google it. They can just click on the link (http://microbialgenomics.org/organisms/enterococcus-faecalis/) and learn all about it. Here they not only see a comprehensive description of this microbe, they see all of our other projects that research this microbe, all of our collaborators, and all of the publications that we have written about it. In addition if they want to know what exact strains we have sequenced, this information can also be found here. And instead of having to go to the NCBI website to search for those strains, we provide links to all the important information about the strains. You see the added value really is that we have pulled in all the information about our work from all different places to be accessible in one place for our collaborators and other researchers; something that has not been done elsewhere. In addition, our content in regards to microbial research goes well beyond what has been provided at genome.wustl.edu. I'm sorry that you don't agree with us that we provide useful content to our users but I disagree with your assessment -- our site is NOT a duplicate of genome.wustl.edu.
It looks like the biggest hangup are the publications. Would people/Google feel better about our site if we didn't have individual publication pages and just linked to pubmed directly? However, then I wonder why it's ok for genome.wustl.edu to basically replicate pubmed? What's the difference?
Secondly, it seems like the organism pages may be a problem. Although I described above all the additional information we present on these sites, would it be helpful if we rewrote all the descriptions instead of citing other sources?
|Re: Website categorized as pure spam||ets||11/5/13 9:09 AM|
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 10:16 AM|
Could it also be that this shows up because I put in a request yesterday to have the outdated pages from the previous owner removed?
|Re: Website categorized as pure spam||Free2Write||11/5/13 10:35 AM|
|Re: Website categorized as pure spam||Suzanneh||11/5/13 11:46 AM|
>>Could it also be that this shows up because I put in a request yesterday to have the outdated pages from the previous owner removed?
What do you mean? How did you do that? Did that have to do with Bing or Google?
Strange that Bing is shutting you out, too -- you sure you don't have any spam on the site? Or maybe Bing is checking Google's results and doing a "If not in Google, don't show in Bing." ;-) (I say that last one mostly in jest. But you never know...)
|Re: Website categorized as pure spam||Suzanneh||11/5/13 11:51 AM|
There isn't any code on the site that redirects users if they hit the back button? Something like Exit Junction?
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 11:53 AM|
As of yesterday Bing still had some spammy urls from the previous owner indexed (e.g. www . microbialgenomics . org /tag/welsh-corgi, www . microbialgenomics . org /tag/satin-bedding). I used the Bing Webmaster Tool's Block URL functionality to block these so that they are excluded from Bing's index.
|Re: Website categorized as pure spam||ets||11/5/13 12:03 PM|
Whatever is troubling Google is also troubling Bing.
Ben checked for ExitJunction Suzanne. :)
|Re: Website categorized as pure spam||ets||11/5/13 12:26 PM|
Ashley's right to emphasize the need to update WordPress immediately.
But in the meantime, have run any malware scanners against WordPress? I'm not a WP person, but there are things like this:
|Re: Website categorized as pure spam||Eric Kuan||11/5/13 12:33 PM|
It looks like there may have been an error when processing your reconsideration request. We're re-processing it now, and you should see a change in the Manual Actions Viewer in the next couple of days.
|Re: Website categorized as pure spam||Ashley||11/5/13 12:33 PM|
Did you ever use the URL removal tool in WMT?
|Re: Website categorized as pure spam||Robbo||11/5/13 1:04 PM|
Here are a few examples of the sort of link text (anchor text) that was being used in link from other site/s to the domain:
"rio home bedding silky satin duvet cover sets 5 pieces -queen …"
and the target URL was:
http:// microbialgenomics.org/ 37-rio-home-bedding-silky-satin-duvet-cover-sets-5-pieces-queen-light-pink.html
which is of course now 404 Not Found.
And looking at the page on the other site the spammy link was on, I see they are authoritative [ :-) ] for things like:
printed designs on coolers
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 1:07 PM|
Is there a good way to find URLs like these so that I can block/remove them?
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 1:09 PM|
Thanks Eric. I appreciate it. If there should still be a problem could I get a more precise list of reasons as to why my site is categorized as pure spam? That would really help in fixing these problems.
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 1:13 PM|
I removed the spammy URLs from the previous owner. I have not yet removed the publication pages etc.
|Re: Website categorized as pure spam||Suzanneh||11/5/13 1:19 PM|
I don't know about Bing, but with Google if a page is returning a 404 Not Found (which tag/welsh-corgi is), there shouldn't be a problem.
It's not the old site that's the problem; otherwise, Google would have reincluded the site. It would be either a hack on the current site (which doesn't seem to be the case) or, as Ashley is suggesting, thin, duplicate content. I haven't delved into the content, but I'd trust Ashley's opinion of it.
Thanks, ets! Getting to be a long thread. :-)
|Re: Website categorized as pure spam||Suzanneh||11/5/13 1:21 PM|
Undo that. Those pages don't exist anyway. Leave them as a 404; they'll eventually drop out.
You can deindex your site if that tool isn't use properly.
When Not to Use the Remove URL tool: https://support.google.com/webmasters/answer/1269119?hl=en
|Re: Website categorized as pure spam||Susanna Siebert||11/5/13 1:25 PM|
Suzanneh, it looks like there may have been a problem with my RR (See Eric's message below).
Also, I canceled those removal requests.
|Re: Website categorized as pure spam||Susanna Siebert||11/6/13 6:04 AM|
As of this morning the "pure spam" designation has disappeared from the Manual Actions tab in the Webmaster Tools. I haven't received any messages or emails, however. So far none of my pages have been indexed.
Does this mean that Google will start to index my site now? Or was this just removed while the site is being reviewed?
|Re: Website categorized as pure spam||Ashley||11/6/13 9:32 AM|
Hey Susanna -
It generally means you're good. I'd give it a few days to crawl the site a few times and re-evaluate.
|Re: Website categorized as pure spam||tucsonadventuredogranch||11/8/13 3:43 PM|
I have been desperately trying to find an answer to a question regarding two disavow link messages I have received in my Webmaster tool account.
Any response or help you can provide will be greatly appreciated. (See below.)
I hope you can assist me in resolving an issue to which I cannot find an answer. Via my Webmaster tool account, I submitted a disavow link .txt file. I received the following two messages - which seem to contradict each other:
http://www.tucsonadventuredogranch.com/ has been updated. If this is unexpected, it may have been updated by another site owner. For more information, visit the Disavow links https://www.google.com/webmasters/tools/disavow-links?siteUrl=http://www.tucsonadventuredogranch.com/ page in Webmaster Tools. Details: You successfully uploaded a disavow links file () containing 0 domains and 0 URLs.
|Re: Website categorized as pure spam||travler.||11/8/13 3:51 PM|
Your thread is at https://productforums.google.com/forum/#!category-topic/webmasters/crawling-indexing--ranking/b9M6LLbXhHs
It's best to keep your issues separate from other threads with different issues. Hijacking is not a good thing :-)
Someone should be able to take a look at your thread and offer suggestions.
|Re: Penalty on Brand Name!||Asi Tisona||11/12/13 11:03 PM||<This message has been deleted.>|