Categories: Crawling, indexing & ranking :

Mystery Links - amazonaws

Showing 1-14 of 14 messages
Mystery Links - amazonaws adjusted-data 1/5/14 9:32 AM
I've read the FAQs and searched the help center. 

So my website was purring along like normal. But all static pages, served via CDN, via Amazon CloudFront. Preferred domain, I decided that I wanted to add some interactive features and integrate some social media. 

So I picked my technology stack, wrote a big check to Amazon AWS to prepay for some instances, and started to code. In mid-December I put up There were no published links to that domain, but I made the mistake of running it through pagespeed, then googlebot started to crawl it, and then boom . . . my search queries vaporized:

I guessed that I was being penalized for "duplicate" content.  So I populated all the testing pages with no index, no follow, I but the robots.txt with Disallow: /, etc. But it made no difference. So I killed . . . it still made no difference. I was in the penalty box, so I decided to make the best of a bad situation and make the big technology changes while site traffic was low. So I changed my preferred name to '', I put nginx up front and 301ed the 'www' requests to my naked domain, I put in a section to protect from spam referrals, etc. In webmaster tools, I changed my preferred domain, I updated my DNS records, I deleted the sitemaps for and added a sitemap to my naked domain entry in webmaster tools. Googlebot started to crawl the new domain and slowly started to add pages, etc. At this point it has indexed about 80% of my pages, it just started to recognize structured data yesterday, and it has added about 50% of the external links from the "www" site to my new naked domain.

I looked through my "Links to Your Site" and so nothing unusual in the days leading up to the penalty being levied. So I have not been able to figure out why I am being penalized. But today I used Webmaster Tools to compare the www domain to my naked domain and I noticed this entry (which only applies to the old www domain):

My guess is that googlebot somehow found it's way to my CDN or S3 bucket and is again penalizing me for "duplicate" content. I downloaded all the links, expecting to find 18,203 links originating from, but when I search through the CSV file, I find no such thing. How do I find the source of those ~18k links? BTW, I don't think those amazonaws links were listed in mid-December (but I could be wrong). So it's possible that first I was penalized for the testing site, and now I am being penalized for these links (that I can't even find). 

In general, does anyone have any ideas or insights that might help me solve this problem? My site serves a lot of teachers and students that will be returning from Christmas break tomorrow, and my search queries from google have been reduced by 90%.


Mystery Links - amazonaws OwnerEditor 1/5/14 9:59 AM
Did you go a step further and click on the links to pages on your own site underneath the warning you mentioned? I did that (I have notification of 3757 links from amazonaws) and found that everything is coming from two pages that are in fact a subdomain, eg ec2-107-22[snip]te.1.amazonaws . com. Those pages no longer exist but I know that Webmaster Tools updates slowly so I'm not surprised that the information still appears there.
(unknown) 1/5/14 10:10 AM <This message has been deleted.>
Re: Mystery Links - amazonaws adjusted-data 1/5/14 10:13 AM
That's another odd thing. It says there are 18,203 total links. Then it only lists two pages. When I click on those links there is only one link under each one (and they are external domains). Where are the other supposed 18,201 links? I find no trace of them.
Re: Mystery Links - amazonaws adjusted-data 1/5/14 10:16 AM
They were done serially. First I did the nofollow, noindex, and then when that had no effect after 48+ hours I blocked with robots.txt. Then when that had no effect I pulled the site. That last move seemed to be the worst the worst of all. But by then there were so many moving parts it's hard to be sure.
Re: Mystery Links - amazonaws adjusted-data 1/5/14 10:21 AM
@Free2Write, "Also, there seem to be sites linking to https versions of your sites. "

Where do you see that? Those are probably links to the old static site that was hosted via Amazon S3 & CloudFront. 
(unknown) 1/5/14 10:39 AM <This message has been deleted.>
(unknown) 1/5/14 10:56 AM <This message has been deleted.>
Re: Mystery Links - amazonaws ets 1/5/14 11:44 AM
My guess is that googlebot somehow found it's way to my CDN or S3 bucket and is again penalizing me for "duplicate" content.

I can see various issues with this site that need fixing - but for solving indexing problems (the possibility of the same page being indexed on multiple URLs), you could implement rel=canonical on all your pages. (So, for example, if the same page is accessible - gives a 200 response code - both via your domain and via a dummy AWS or Cloudfront address behind the scenes, Google needs to know which one you want indexed. You can do that with a canonical URL.) I don't see canonicals at the moment. This helps with all kinds of issues (including www/non-www) and things you cannot predict (such as people appending query strings to URLs).

Re: Mystery Links - amazonaws Gary Illyes 1/5/14 1:35 PM
Hello Ernest and welcome to the forum!

You want to keep in mind that the users (and also our algorithms) prefer unique and compelling content. See the following related Help Center article:
If having 100% unique content is not possible, the webmaster should make sure to have elements in the pages that are unique and valuable for the users, give a good reason for them to visit the site. With websites like that, generally my advice is to take note of the Webmaster Guidelines ( ), and change the content and/or site if needed.

Good luck with your sites!
Re: Mystery Links - amazonaws adjusted-data 1/5/14 5:30 PM
Thanks for the welcome, and I appreciate the links. But, you just told me to do me do what I was already doing when the problem occurred. I really don't think that doing it harder is the answer. I think the real problem here is that google somehow sees 18,000 links of "duplicate" content. I think the real task here should be killing off those links. If I improve the content, I go from having a site that is not getting any traffic due to a duplicate content issue to a site with more compelling content that still isn't getting query traffic because it is still has a duplicate content problem?  There must be a pretty specific reason that 95% of my query traffic was killed in about 36 hours?

My most immediate goal seems like it should be finding the source of those 18,000 links that Webmaster Tools says I have -- yet doesn't include in any of the reports that I can find -- and killing it. The good news is that I do think that I did find the source of the problem. My static site was being stored at an S3 bucket behind the CDN, but it also has an end-point. I had forgotten about that. I think that somehow Google discovered that S3 endpoint and indexed it. That would explain a lot the domain domain that it sees.

So what is the best remedy now? Should I kill that endpoint or should I redirect it to my domain?  Is it better to feed it HTTP/1.1 301 Moved Permanently or should the endpoint url just disappear? I did a redirect but will kill it if that is preferred.

(unknown) 1/5/14 5:46 PM <This message has been deleted.>
Re: Mystery Links - amazonaws ets 1/6/14 8:45 AM
No, you need to redefine the question. Forget the mechanics of how the site is built for a moment and ask simply "why did I suddenly lose most of my traffic?" Then re-read Gary's answer.

Where does the content come from? If all the site does is republish material available on sites such as Project Gutenberg, Google won't see it as a useful site. It doesn't matter how the site is built or what server/CDN you use; that is a fundamental flaw that you must address first.

Maybe you need to get people commenting on the texts you serve or develop more of a community angle? You need to add some unique value.

Re: Mystery Links - amazonaws StevieD_Web 1/6/14 12:01 AM
Re-read Gary's post 

(BTW, Gary works for Google and his answer was marked correct/best by John Mu, another Google employee)

Gary basically said you are chasing the wrong issue/problem and pointed you in the direction of the Google guidelines...... LITTLE OR NO UNIQUE CONTENT

ets responded:

>Where does the content come from? If all the site does is republish material available on sites such as Project Gutenberg, Google won't see it as a useful site.

The problem isn't with your presentation of the information or how the information is linked or not, your fundamental problem is NO UNIQUE CONTENT.