|The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/22/12 9:39 PM|
Once upon a time, there lived a webmaster who built a website for all the engineers in the world so that they could share interesting ideas and offer technical help to each other. He created a forum and a blog to keep people updated about latest in engineering and technology. The webmaster focused on content and never engaged in any link-building or black-hat SEO. His website received lot of visitors who created content but they were not native English speakers who struggled with English. The webmaster and the other members of the website always went out of the way to help people and fix problems.
The website became so popular that it got recommendations from Steve Wozniak (Apple Co-Founder), Dr. Stroustrup (creator of C++), Dr. Moshe Kam (IEEE CEO), Paul Buchheit (Creator of Gmail) and many renowned people. The community of engineers continued to grow, adding more engineers to the community and creating great discussions.
But the good times didn't last long. In the first week of September 2012, out of nowhere, the website's traffic dropped by almost 50%! The webmaster and his admin team began investigating the cause. Is it 'The Panda' or is it 'The Penguin' or what is there a blunder done by anyone in the admin team? They began investigating. The team checked Google Webmaster Tools and found that Google search engine has found several thousand 404 errors! They even found that Google had dropped the crawl rate to about 30% of the initial! They quickly checked and everything seemed to be working fine. They thought the errors would be temporary and decided to check after a while. They checked after a few days and found that the error count was going up and up! The traffic to their website was down to about 50-40% of what their site used to get. Something was definitely wrong!
The team decided to remove Disqus from their website and also decided to revert the sidebar they had added to the forum pages. They made sure that the site loaded fast and there were no errors on the site. Even after fixing all the errors and removing all the possible sources of issues and waiting for over 1.5 months; Google still crawls at rate 20-30% of what it used to crawl earlier. Every day, they mark 1000 errors as 'Fixed' and continue to hope that Google will crawl their website more. They continue to contribute content, but haven't had any success so far.
The entire team has abandoned their grown plans and now working on getting their traffic back - which they lost for no fault of theirs. The webmaster summarised findings on the 404 error graph :-
The webmaster and his team would appreciate any feedback on how to get google to crawl their website more and acknowledge that they've fixed all the errors already?
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||I know nothing||11/22/12 10:48 PM|
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/22/12 11:16 PM|
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/23/12 3:28 AM|
Great site by the way. I'm actually a Mechanical Engineer myself and I didn't know of the site but will check it out frequently now.
Have you tried doing the "Fetch as Googlebot" option in webmaster tool and then the "Submit to index" option (choose the option to crawl the page and all of the pages it links to)?
It could help to speed up the crawling of the website and the 404 pages.
I've not seen it happen, personally, before but I suppose the massive increase in 404s could have caused a problem.
Is there a particular section of the website which has seen a drop in traffic or is it across the board?
It is also very sad when Google gets it so wrong that they ruin some great websites and businesses.
Hope you get it to recover soon.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 3:59 AM|
Thank you for your appreciation, Justin. We hope to recover soon so that we can focus on bringing new, innovative things for all the engineers in the world.
The drop in Googlebot crawl rate and the surge in error count is almost in sync. It's logical that Google bot wants to crawl a website with 'so many errors' less frequently. We've already downloaded all the errors through API and can confirm that we've fixed all of them (by setting proper redirects). There are only a few (~1000) that should return 404 because those pages are actually removed!
We've already tried -
1. Fetch as Googlebot -> Submit to index option.
2. Tried increasing the crawl rate; but later chose to 'Let Google Decide' upon advise from experts here.
3. Have ensured that the site loads faster than ever before. At least it's not as slow to make Google visit it less frequently. Each page loads < 2 seconds for any regular user.
I'd really appreciate a relook at our GWT account from a Google employee. I know it's very difficult to attract attention from them, but me and my team are hopeful.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/23/12 4:31 AM|
Out of interest, do you have an abnormal level of "not selected" pages?
If you go into webmaster tools, "Health", "Index Status" and click the Advanced tab, we've seen a trend with websites losing rankings when the number of "not selected" is greater than the number of indexed pages.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Suzanneh||11/23/12 4:37 AM|
I had over a million 404 pages listed (because of an installed software gone wonky) and my site was not effected (I think I mentioned this in a thread of yours but I'm not sure). You seem to be convinced it's the 404s, so I doubt I can change your mind but I thought I'd try. Google has said 404s do not effect your site http://googlewebmastercentral.blogspot.ca/2011/05/do-404s-hurt-my-site.html There could be other algorithmic issues effecting your site.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Stephen Sherman||11/23/12 6:27 AM|
I agree with Susannah; I do not think it's the 404s.
To try to help you, I clicked on the 4th story, about the Toshiba robots. it does include a couple of (apparently) original paragraphs, and a photo (looks like Toshiba's photo) and a Youtube. So I think that the photo and the video are non-original content, while the 2 paragraphs of text ARE original content. IF that is representative of your site, then Google's algorithm MIGHT determine that to be a relatively light mix of original vs. on-original. (Of course, since the algo is secret, that's why we are all guessing here.)
Google's algo is constantly being tweaked and what might have been a "good balance" (of orig. vs. non-orig content) yesterday, might come out poorly ranked today.
I mean no criticism of your site, but wanted to try to offer some explanation, other than the 404's (which I really do not think are an issue.)
All the best.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 6:47 AM|
Looks like we're on to something here. Yes, we do have abnormal level of 'Not Selected' pages. How do I go about fixing them? Is it because of the Internal linking of tags that we have? I think we've something to work upon here!
I'd really appreciate your inputs and feedback in fixing this. We never knew this'd be a problem!
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 6:52 AM|
I'm aware that 404s do not affect website ranking (as Google confirms it). Since the traffic drop, crawling rate drop and the growth of 404s are almost in sync, I concluded that 404s are causing Google to think that it's visiting a website with errors and that's why it's not crawling us frequently - which might be affecting our ranking. Of course, I can be totally wrong.
I'd really appreciate feedback from the experts here. Can someone take a critical look at our 'Community Forms' which seem to be affected the most because of some algorithmic change? The URL is - http://www.crazyengineers.com/community/
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||zihara||11/23/12 6:54 AM|
I, too, see an ongoing issue with the 404s. Webmaster Tools this morning even offered me a message about an increase in 404s on one of my sites. Investigation showed the 404s are being generated by a bad bot read of a bunch of crap on ask .reference .com - the actual links there are properly implemented, properly closed and address the full, true url, but in the next line of text is a mangled version of my actual page url... There's absolutely nothing I can do about it. In looking further into the ask .reference .com junk I found that every url listed there (and there's millions of them) is included with the same correct link structure but the similarly mangled url mention in the next line. We've seen hints of an unholy relationship between Google and Ask before, is this more sign of that? Whatever it is, it also hints at yet another serious problem in the basic functioning of the Googlebot: it can't differentiate between a real link and a simple textual (partial) mention...
I also noticed that not a one of the millions of links at the ask.reference abomination are nofollowed... Also, I'd never have known the ask.reference abomination even exists if I hadn't investigated those 404s newly listed in Webmaster Tools... Helluva method of promoting a new crap site.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 6:55 AM|
I really appreciate your response. Our articles are edited and most often based on press releases, tips shared by members and our own findings from the social media. If we are not mistaken, that's how most of the leading blogs create their content.
I agree that the images that we have on our blog posts aren't ours. However, we give full credit to the sources from where we've taken them (source is mentioned at the bottom of the article). We tend to take the images from the 'Media' section of the company websites most of the times.
I admit, I don't know whether this is allowed or not. But most of the leading blogs do the same and do just fine. I'm not sure if embedding a video is really a problem.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 7:01 AM|
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/23/12 7:08 AM|
Fascinating! That's the exact same trend we've seen on quite a lot of sites that have lost rankings recently. I'll take a quick look at your site now and see if the same fixes we applied to other sites would work to recover the rankings for yours.
I'll post back in a bit, just a little tied up with client stuff at the moment!
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 8:35 AM|
Finally, it looks like after 2 months of panic; I've something to look at and fix. Would really appreciate your suggestions. Two things that might be useful -
1. We switched from vBulletin forum system to Xenforo system and large number of URLs were 301 redirected from domain.com/forum/<url> TO domain.com/community/<url> on December 01, 2011.
2. We had installed a wordpress tag auto-link plugin, which would find tags in the post content and auto-link them to the relevant tag pages internally. This might have caused that jump in the 'Not Selected' URLs. I've just disabled the plugin.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/23/12 8:50 AM|
Yep, I can already see some issues that need sorting in the forums. I have to go out now to a concert (need it) but I'm in the office tomorrow so will have a longer look and post suggestions based on what's worked for other sites that I've worked on.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/23/12 11:56 PM|
I'd really appreciate your review and feedback on our website. I've began collecting information about the 'Not Selected' URLs, but not sure how to go about fixing them. I've removed the inter-linking of tags from WordPress posts which might be causing problems. But not sure if that is the real cause of problem. I'll be really thankful to you if you could suggest few fixes on the forum section of our website.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/24/12 7:09 AM|
Hello, did you get a chance to go through our website? I'm really looking forward to some clues on how to fix the 'Not Selected' URLs issue.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/25/12 3:02 AM|
Hi mate, sorry, was snowed under yesterday and had to leave the office early.
Anyway, I've just had a deeper look and I think your problem is quite different to others I've worked on.
Every time there's a drop in rankings it's usually due to something someone has done. I think that's the case here.
The site itself is perfectly healthy, it is not penalised in search results in any way.
The problem comes in how you have switched from /forum/ to /community/
Looking at an old version of the website I can see that up to at least July of this year you were using /forum/ and later on switched this over to /community/
However, you have made a big error in blocking /forum/ in your robots.txt. You are essentially stopping Google from following the redirects you have put in place to the new /community/. The redirects work fine for a user but not for a search engine.
So the "not selected" is because you now have thousands of pages which Google can no longer crawl or determine if they have been moved. Then all of a sudden it has found the content that was on those pages before but under the new /community/ directory. The association between the old and the new is lost and this is the real reason those pages have lost rankigns. They are essentially not just thousands of brand new pages.
All you need to do is remove the block on /forum/ from the robots.txt file and wait patiently for Google to recrawl all of those urls (although try asking Google in GWT to recrawl the /forum/ directory only). I'm not sure if after all of this the rankings will return to what they were but they will certainly make a big improvement.
With the current community section you also now need to block or remove all links which really should not be indexed, for example,
When it says "7 people liked this", that goes to a "/likes" page....block it
Each of the numbers of each post, eg, #1, should be nofollowed.
Members whose profiles are not visible to the public, you should block these, eg, http://www.crazyengineers.com/community/members/sada.45192/, as they serve up the same content.
The "Trophy points", "Followers", etc, links in members profiles should all be nofollowed too.
Anyway, you get the gist.
The main ranking problem is from blocking Google from following the redirects. Noindexing the unnecessary pages of the forum will help to bring down the number of "not selected" pages.
Best of luck with it and let us know how it goes.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 3:37 AM|
First of all, thanks a lot for your time in analysing our website and providing with feedback. I really appreciate it. Allow me to briefly post history of the website and have your opinion -
1. We switched our forum from vBulletin to Xenforo on December 1, 2012. This was done with proper 301 redirects in place. I had ensured that the URLs were correctly being pointed to their new location.
2. January - August end traffic was about 40% of the initial traffic, because we were told that Google would need that time-frame to acknowledge the new location of URLs.
3. I performed a site:crazyengineers.com/forum/ search to ensure that all the /forum/ URLs had been dropped from Google's index before blocking Google from crawling /forum/ URLs. I thought I was taking the right step by telling Google that it no longer needs to crawl the old location because there's nothing useful there anymore. Even /forum/ redirects to /community/
4. So even after the chante to /community/ we had the robots.txt in place without blocking /forum/ for long time (about ~7-8 months, I guess).
5. I've already blocked member profile pages from Google about 1.5 months ago.
6. I've been aggressive in blocking thin content from Google through Robots.txt.
Would really appreciate your opinion.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 3:44 AM|
Forgot to mention - The site did 'pretty ok' with about 60% of it's initial traffic (post convertion from /forum/ to /community/ ) until September and overnight saw a fall to about 50% between 4-5 September. Of course there was no Panda or Penguin refresh at that time.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 3:59 AM|
Here's one more thing I think I must mention -
On 9 September, Google notified that it has found extremely large number of URLs and provided following sample list -
Since then, I've blocked all those locations via ROBOTS.TXT. It however looks like Google's identified the problem and dropped them from their index. The small step in the blue line indicates it (in above attached image).
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/25/12 4:15 AM|
There's definitely a problem with the forum though. If you do site:www.crazyengineers.com/forum/ and click the similar results link, you'll see over 21,000 forum links that it's tried to index but can't.
I don't know why you would have blocked the /forum/ directory at all. People may from other sites around the web linked directly to particular forum discussions. Those links will now not be working for you and will mean that Google will continue to follow them. You'll be able to see links to forum discussions in GWT.
I still think if you unblock that you will see the difference.
Did the drop in traffic come shortly after blocking /forum/ in the robots file?
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 4:19 AM|
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/25/12 4:27 AM|
When you do the site search you do get two results but then it says:
In order to show you the most relevant results, we have omitted some entries very similar to the 2 already displayed.
If you like, you can repeat the search with the omitted results included.
If you click the link in that you will see all the 21,000 results. The reason it condenses them into two is because they are all the same as it can't crawl the pages.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 4:32 AM|
Got it! Yes, I do see that problem now. I've removed /forum/ from Robots.txt. Do you have any suggestions on the 'extremely large number of URLs' found message reported by GWT (as I mentioned earlier) ?
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/25/12 8:35 AM|
Ugh..the SEOs that come here....
No, 404s do not affect a site. Not in the way you are describing.
On Friday, November 23, 2012 6:28:33 AM UTC-5, Justin Aldridge wrote:
Great site by the way. I'm actually a Mechanical Engineer myself and I didn't know of the site but will check it out frequently now.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 8:43 AM|
@Lysis : I think it's pretty clear now that 404s don't hurt the website rankings. But 404s were all I had that matched with the traffic drop. Anyway!
I'd really appreciate your inputs on fixing the "not selected" URL issue. If you see the graph I've shared above, you can see that the number of "not selected" have gone up tremendously. Our website doesn't have those many URLs (~5 million!) and I'd really appreciate your inputs on how to go about finding the issue. I'm totally stuck!
I really appreciate everyone's responses here and help.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/25/12 8:54 AM|
"Not selected" can be a multitude of reasons. Poor quality, similar or duplicate content..thin stuff that doesn't need to be in the SERPs (ie a list of parts that people wouldn't want in the SERPs..stuff like that).
No one has all pages indexed. That's just the way it is. Most people don't know of every page they have. If you have 5 million unselected pages that you don't think you have, then it's likely some kind of querystring value being passed that still resolves to a blank or thin page. That's usually what happens.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 9:34 AM|
For the past several weeks, I've been aggressive about blocking 'thin content' (ex. member profiles, tag pages, feed content and so on). As I said, on 9 September Google reported that it had found 'extremely large number of URLs' on the site (and the sample list shown had all the paginated content) and by the end of September, it dropped them [you can see a step in the blue line in the graph I uploaded earlier].
I've been analyzing my website for the past few weeks for technical errors but haven't found anything that's wrong. The URLs are all canonical and shouldn't cause problem.
I'm continuing to investigate the problems. I'd however appreciate anyone providing a clue or direction in which I should proceed. It's been ~3 frustrating months for me. I really want to end the issues and focus on creating high quality content and experience for my members.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 9:35 AM|
By the way - is it possible to see what kind of content is being considered as 'thin' by Google? Is there any way to find it out?
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/25/12 10:30 AM|
Not processing bad querystring values to a 404 isn't really something "wrong" per se. But, if the code doesn't handle a querystring value and returns a 200 status with thin content, then yes, that is bad for Google but not really something I would say is "wrong" with the site. I'd say more sub-optimal user experience. I have found crazy rogue pages on large sites, and the site owner doesn't care. You can remove them or redirect or whatever you think is best for users, but these pages won't likely get indexed.
No, there isn't a way to see what is considered thin. You basically have to put yourself in the user's shoes, but finding these rogue pages is something that is best done by the site owner/programmer, because you know what querystrings should be active and which ones should not.
5 million unknown pages just sounds like you aren't responding with the proper 404 for bad pages or you have querystring values that return a 200 with error messages and/or blank pages or pages with thin content.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/25/12 8:51 PM|
Hello Lysis, our website has editorial content on our blog and it definitely isn't duplicate. Our forums do have few thin-content pages like on any forum, but most of the discussions are high quality and solve technical problems or ideas. We've been agressive in blocking / noindexing the pages that don't really add value from search engine's perspective: Ex. Profile Pages, tag pages etc. I know that the goal has to be 100% content rich website, but that's a little difficult to achieve, when the site gets user generated content.
We can of course go through over 50k discussions on our website and go on deleting the thin content pages. During our internal survey, people didn't want their post count to be reduced; because we've been traditionally using it as one of the signal for our members' promotions.
I however think that Google's smart enough to identify a forum and drop the thin pages on its own. I don't think it'd really push us down in rankings for some of the low value content contributed by our members.
I'm bit more positive today. Google's given me a Birthday Gift and dropped ~4500 '404' errors from their list. At least, Google's recognizing that the site doesn't have 'errors' anymore and dropping them at faster rate. Not that this will directly improve our rankings - but I'm positive that a cleaner, error-free site would get more Google love.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||JohnMu||11/26/12 6:35 AM|
Just to confirm what others have said, having a large number of 404s is absolutely fine and would not cause negative effects in crawling, indexing, or ranking (apart from those pages that return 404 not being indexed -- which would be normal in most cases). The number of "not selected" URLs is based on URLs that are either substantially similar or redirecting -- if you have changed your site's URL structure and have redirected those URLs, then that would be a good explanation for that. That curve would also be fine and not a signal of a problem.
One thing I noticed in the graph was the significant rise in URLs blocked by robots.txt -- can you elaborate on what you changed there? In most cases, it's not necessary to block URLs via robots.txt unless crawling them themselves is causing problems. For example, if we were trying to crawl URLs serving duplicate content, it would be better to work with redirects or rel=canonical (or just make sure we don't find those URLs while crawling) -- blocking the URLs via robots.txt would make it impossible for us to recognize that these are URLs that we don't need to focus on.
Looking at your site overall, one of the things I noticed was that it used to be ranking highly for queries like "facebook login" & "firefox", which -- while parts of your site do look good -- I imagine your site isn't the best choice for. When our algorithms are adjusted, we do try to help improve the relevance of our search results, which in the end could result in a reduction in the absolute number of visitors to a site like yours, while at the same time, working to improve the relevance of those visitors.
Finally, given that your site has a very strong focus on UGC, I'd recommend working to try to find ways to make sure that the content which you're providing is of the highest quality possible, unique, and compelling to your users. Users may be generating this content, but you as the owner of the site are publishing it, so when our algorithms review a site overall, they will also be reviewing content like that. A good way to think about this is to take a step back, find someone who is not directly associated with your website, give them some tasks to complete on your website as well as on others, and then go through the questions listed on http://googlewebmastercentral.blogspot.ch/2011/05/more-guidance-on-building-high-quality.html . For example, should you find low-quality content, you could think about adding a "noindex" robots meta tag, to allow users to continue to find the content when they're on your website, but to prevent it from being indexed and used for search. Finding such content manually can be very time-consuming, but given all of the crazy engineers you have on your site, I'm sure there's a way to help automate some of that :).
Hope it helps!
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/26/12 7:50 AM|
Thanks a lot for your detailed comment and review of our website. The robots.txt blocks all the pages that actually offer user generated 'thin content' ( example: profile pages, extensions of URLs, forum/blog software files etc.) While we used to have them 'open to Google' I thought I'd rather block them and save some precious time Google spends on our website. So, I'm bit confused whether I should allow Google to crawl those pages and determine whether they should appear in Google or just keep them blocked. I'm quite sure that the technical errors aren't there anymore - and those reported by Google were quickly removed (about 1.5 - 2 months ago)
We never 'optimised' the site for rankings; with just the basic on-page SEO best practises, we only focused on content. The keywords 'facebook login' and 'firefox' used to rank because (I remember) we had pretty interesting & unique discussion. But most of the traffic was from engineering studies related stuff, programming questions, mechanical & civil engineering related questions. However, even the most relevant discussions on the site don't rank anymore. It clearly seems to be a algorithmic penalty for the domain (I'm pretty sure it's non-Panda, non-Penguin). Traffic dropped on 4-5 September and that was the exact time when the errors went up high.
I'd appreciate your suggestions on what content to block in robots.txt & if there are any steps we can take to recover our traffic? The irrelevant keywords that used to send traffic to our site are just a fraction of the overall traffic. I believe that the 'Not Selected' URLs is a part of the problem - because in any case, we can't have those many duplicates on the site.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/26/12 7:58 AM|
Something I always get messed up is crawling and indexing are not the same. So, you would want to do what JohnMu said, which is "noindex" those thin, poor quality pages and let Google crawl them, so then the bot will remove them. After they get removed, then you can block the crawler, but for now you want to noindex them and open them up for google to crawl and see your changes.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||JohnMu||11/27/12 6:40 AM|
The robots.txt file isn't a way to block indexing, it just blocks crawling, so I'd really recommend allowing crawling of URLs like that, and just placing a noindex robots meta tag on them instead. That helps to make it clear that there's really nothing there which is worth indexing. The robots.txt disallow leaves that open - we don't really know (because we can't check) if there's just duplicate content or something really important on that page. I wouldn't worry about the "not selected" counts here, it's normal to see them like that. Instead, I'd really try to take a step back, and to see if you can work out a "bigger picture" plan to make sure that all of the indexable content on your site is of the highest quality possible, and that overall and individually it's something that users would expect - demand even - to find in Google's search results.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/27/12 7:35 AM|
John, I have to disagree and I think this is where Google is going wrong. On a community forum such as TheBiGK's how is he supposed to decide what engineers all over the world will find of interest? I'm a programmer and often I go online in search of bits of advice to help me. Sometimes it can be an apparently insignificant comment in a short thread which sorts it for me.
If seemingly "unuseful" threads such as this are removed from the index then what's the point? I'll have to use Bing instead to find out this stuff.
The owner of the forum shouldn't be deciding what is good or not so good. That's up for the searcher to decide, but we want to find that information, no matter how pointless it may seem to someone else.
What happened to an open web of free speech?
I'm sorry John but in trying to combat spam Google is taking things too far and punishing far too many excellent websites in the process.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/27/12 7:39 AM|
You can do whatever you want with your site just like Google can do whatever they want with their site. It isn't all about google. If you want to keep crap pages on your site, then do it. Google choose not to index them. That's their prerogative.
You really think Google hasn't mined the data to figure out user habits on pages like that?
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Justin Aldridge||11/27/12 7:49 AM|
But that's my point, you say "Google can choose to index them"....exactly. It's up to the search engine to index them and decide, it shouldn't be up to the owner of the website to decide if the thousands of UGT forum discussions are valuable or not.
Even if a thread only ever gets found twice for specific search terms, it still makes it valuable to the searchers those two times.
Why restrict access to content? It doesn't make sense.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/27/12 7:58 AM|
You are not looking at it from a user's perspective. There is a lot of crap that pisses people off when they get it in their searches.
He has every right to not worry about Google and do his thing regardless of Google.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/27/12 8:34 AM|
I've cleaned up my robots.txt file and now let Google crawl them. I might put a 'noindex' tag on profile pages, but after talking to several webmasters of the top forums, it turns out that 'noindex' is completely optional - and Google does a good job of omitting them. I've several members who've created their profiles on our site and have filled it up with unique information about themselves and want it to appear in searches. So I'm going to wait a while before putting a 'noindex' tag.
I asked in another thread whether 'short content' is thin content - and John said it's not. If you look at our forums, it's got lot of useful content; but not many engineers are native 'English' speakers. They have a ton of knowledge which they can't express in correct English and often leave a comment or two - which is right on the spot, useful and effective. I've observed that such threads don't rank well - because Google too has limitations.
I request Google -
Domain level algorithmic push-down is a little bit too-much!
I've been learning that low quality content on our website can push down the entire domain down in search results. I see that's what we're experiencing as of now. Few of the highly informative threads that used to be on the top and offered quick & correct information to the users are nowhere on the front page now and have been replaced by low-value websites that basically have picked up content from our discussions OR just have some information collected from various websites.
It'd be really great if the domains aren't pushed down.
Past ~3 months have been really frustrating for me and our team; the questions on our community aren't getting answered quickly and all of us spend time learning the SEO best practises and what really went wrong. That's really not what we want to do. We just want to have a basic website setup (in our case, it's just WordPress + community software) and focus on what we're good at.
Regarding our recovery - I'm not sure if robots.txt and focusing on quality content 'from now onwards' will help us recover our rankings. I still believe that there's something we've missed out on that'd bring us back to our earlier glory.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/27/12 12:20 PM|
John, I think an answer to a question can help me proceed in the right direction. We all know that 'user experience' is a ranking factor and a bad user experience can bring down Google's trust. Now, it's logical that when there are 'external' links that point to 404 errors, they won't harm the website. But what if there are several thousand 'internal' links that result into 404s?
Allow me to explain the situation: In our case, the Disqus plugin (powers WordPress comments) created about 4-6 '404, Page Not Found' errors internally for each correct URL. For example, each domain.com/happy-url/ got its clones with domain.com/happy-url/12345/ , /happy-url/45678/ and so on.
It's logical that when Google visited our website, it suddenly saw large number of 'Internal' URLs that were pointing to non-existent pages. As a result, Google concluded that our website has a TON of crappy URLs and marked us 'Low On User Experience' and lost 'Trust'. Therefore we lost rankings.
The drop in Google's crawl rate (which dropped around the same time as internal 404s surged) also fits perfectly in this theory. Your inputs would be valuable.
PS: I've fixed those internal bad URLs by 301 redirecting them to correct versions, and even disabled DISQUS.
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||Lysis||11/27/12 12:31 PM|
>> But what if there are several thousand 'internal' links that result into 404s
/cry...no it doesn't affect rank
|Re: The never-ending, interesting story of 'Non-Recovery', Errors, Crawl Rates & Frustration||TheBigK||11/27/12 12:41 PM|
That's bit disturbing. :-(