Categories: Crawling, indexing & ranking :

Crawling Errors for pages that Don't Exist // Google crawling fake pages // 404 Not Found Errors ::: Auto-Response :::

Showing 1-3 of 3 messages
Crawling Errors for pages that Don't Exist // Google crawling fake pages // 404 Not Found Errors ::: Auto-Response ::: Autocrat 12/4/10 10:26 AM
Please - do Not post your questions in this topic!
------------------------------------------------------------

This is an Auto-Response for the (Very!) common question/issue about Bad/Wrong/Non-Existant URLs being Requested/Crawled by Google.
This would incldue the following;

* Google requesting URLs that don't exist
* Google requesting pages taht don't exist
* GoogleBot requesting URLs that don't exist
* GoogleBot requesting pages that don't exist
* Google reporting 404 Errors for URLs I don't have
* Google reporting 404 Errors for pages I don't have
* Why is Google crawling pages that don't exist
* Why is Google requesting pages that don't exist
* Why is Google crawling URLs that don't exist
* Why is Google requesting URLs that don't exist
* Why is GoogleBot crawling pages that don't exist
* Why is GoogleBot requesting pages that don't exist
* Why is GoogleBot crawling URLs that don't exist
* Why is GoogleBot requesting URLs that don't exist
* Where is Google getting these bad URLs from
* Where is Google getting these bad Pages from
* I don't have these pages - why is Google trying to crawl them
* I don't have these pages - why is GoogleBot trying to crawl them
* Page not found Errors - for URLs that are wrong
* Page not found Errors - for URLs that don't exist
* Page not found Errors - for pages that don't exist
* Crawling Error reported - URL doesn't exist - why is Google crawling it

etc. etc. etc.

---------------------------------------------------------------------


NOTE:

This is a "general auto-response" post.

This is Not a Topic for discussion;

It is a point of reference to save having to type the same answer repeatedly due to the sheer number of times this question is asked and is meant as an aid for people that don't seem search/read the various other posts regarding this topic.
Thank you for taking the time to read this Auto-Response.

Please - do Not post your questions in this topic!

2010-12-09

Appendment  (09/12/2010):  No 404s - but the URLs are wrong/don't exist!
Re: Crawling Errors for pages that Don't Exist // Google crawling fake pages // 404 Not Found Errors ::: Auto-Response ::: Autocrat 12/4/10 10:28 AM
Please - do Not post your questions in this topic!
-----------------------------------------------------------


=============   Google Crawls   =============

Well - that's the basic principal of a web-crawler ... it crawls things.
This means that G has obtained a URL (somehow) and noted it down ... then tried to crawl it at somepoint.


=============   How does it get URLs?   =============

Primarily - from Links ... those on your site and on other sites.

Additionally, there are Sitemaps, and it may try to "extract" a URL from code such as JavaScript, Frames etc.

Further still - it may attempt to "guess" at URLs ... if G see's you ahve 20 pages in sequence (page1, page2, page3) it may go looking to see if there is a page21 and page22 etc.
It may also be looking at any Forms on your site ... as in some cases, Googlebot may use a form to explore your site.


Something else to keep in mind - Google Remembers things!
So if there "used to be" something at that URL - G may well remember it, and be trying to revisit it


=============   But 404's are BAD!   =============

Actually - directly, no they are Not!
Google is not going to penalise/punish you/your site because it see's a 404.

Indirectly - it may have a knock-on effect, or several.
It means that G is using up requests to your site/domain/server for URLs that are wasted, rather than those that do exist,
which may slow down how often other URLs are crawled (this is guessed - Not confirmed!).

Further - if a URL used to have content, and is now gone, then it may mean the value that URL gave to the rest of your site is now being lost/wasted.
For more info, please see this topic;
   Changed File Name/Path/Extension/Type // Changed DomainName // Renamed/Moved Files ::: Auto-Response :::
   http://www.google.com/support/forum/p/Webmasters/thread?tid=10a2c9f4b92fa76d&hl=en


=============   So what about "these" Bad URLs???   =============

Well - you can start looking for info to checkout the bad URLs, and where they are coming from.

--- Google WebMaster Tools (GWMT/GWT) ---
You are seeing Crawling Errors.
In the Crawling Error section, it gives a little table showing the bad URLs.
On the right - it may show you a link (linked from).
Click it!
Congratulations - that will show you were the links originate.

--- Xenu Link Slueth ---
You can run this tool on your site (http://home.snafu.de/tilman/xenulink.html).
It will crawl your URLs, look at your pages, and if you are linking to the bad URLs - it should show them as being 404s.
You can then right-mouse-click on the bad listings, go to properties.
Voila - it will show you what URLs are linking to it.

--- Server Access Logs ---
You can poor over the Servers Raw Access Logs (should be available via your Hosting control panel - if not, then ask your host - if you have no access lgos ... Get a New Host Now!).
Look for bad resposne codes (do a search for "404").
Look for the Referrers URL.
You may see it as your own site, or someone elses.


=============   But what can I do about it?   =============

Well, that depends - entirely on how the bad URLs are occuring.

--- External Sources ---
If the Bad URLs are originating else where (on some other site) you have 3 main options;
1) You can contact the owners and ask them to correct it (or if you submitted it, go and edit it, and be mroe careful next time!)
2) You can setup server/scripted 301 Redirects (go read the "Changed URL" topic from above)
3) You can simply live with it

--- Internal Sources ---
If the Bad URL originate from your own site, then you have the following options;
1) Go and fix the flamming code! (Sorry, but kind of obvious don't you think?)
2) Revise how you use/supply URLs (use Absolute instead of Relative, less room for screwups!)
3) Test each link on your site when you make them!
4) Go to the page that seems to be the source of the error/bad link ... and use the TAB key ... keep an eye on the Status (bottom right of the browser) to see the URL that shows up ... you may tab through and find the dodgy link.  You may want to do it with and without JS enabled (as sometimes it's the JS that causes the issue).


=============   I've looked - but cannot find the fault/error!  =============

Well, not being funny - go look again, but harder/properly :D

Common causes?
* Invalid code - if you don't close up your tags correctly, you may generate incorrect URLs.
* Relative URLs - these are a common problem, and may result in attaching part of the path to a previous URL
* Incorrect Base Href element - this would result in attaching the relative path to the incorrect root
* Incorrect Redirects - if you have ReWrites or Redirects, make sure you have them working to correct URLs
* Incorrect Canonical Link Element - if you have the wrong URL in the href of the CLE, then G is gonna get it wrong as well
* JavaScript - you should either make it External, or wrap it in CDATA (inside the opening/closing script tag, use  //<![CDATA[  your JS  //]]>)
* Check your Form Action URL - make sure you aren't letting G get the wrong idea

Common fix?
* Start using Absolute/Full URLs (use  'href="http://blah"'  rather than  'href="/blahblah"' )


=============   It's caused by an External site, with strange paramters?  =============

Well, if it's some other site linking to you, with additional bits attached,
try the following topic;
   Google indexing Strange URLs - URLs with strange Parameters being indexed // ?referer= // ?ref= ::: Auto-Response :::
   http://www.google.com/support/forum/p/Webmasters/thread?tid=4788b951500733bf&hl=en


=============   I corrected it/them - Why is Google still trying it/them!?!?!?  =============

I did say that G remembers URLs!
Unless you have setup Redirects - then G is gonna try it for some time untill it gets the hint (and GoogleBot aint that quick to catch on at times).


=============   I corrected it/them - Why is Google still showing the Error(s) !?!?!?  =============

Because the Errors in GWMT hang around for some time (up to 4 weeks).
Only after that time period (so long as it hasn't re-encoutnered it/them), will the error disappear.

If the error(s) occurs again - then the date/time will update, and you have to wait another 4 weeks (approx).

---------------------------------------------------------------------


[ AGAIN - I have to post this reminder due to the sheer number of people that are hard of reading ]



NOTE:

This is a "general auto-response" post.

This is Not a Topic for discussion;

It is a point of reference to save having to type the same answer repeatedly due to the sheer number of times this question is asked and is meant as an aid for people that don't seem search/read the various other posts regarding this topic.
Thank you for taking the time to read this Auto-Response.

Please - do Not post your questions in this topic!
Re: Crawling Errors for pages that Don't Exist // Google crawling fake pages // 404 Not Found Errors ::: Auto-Response ::: Autocrat 12/8/10 7:16 PM
Please - do Not post your questions in this topic!
---------------------------------------------------------------------

[ Appendment ]

=============   No 404s - but the URLs are wrong/don't exist!  =============

This happens in some cases.
Basically - rather than giving a correct 404 response fora URL that doesn't exist, people (and bots) are presented with;
a) a page with content and a 200 response.
b) a page with content and a 200 response ... but often without the correct CSS/images not working etc.
c) a temporary redirect response  to some other page (such as the homepage).

This can happen for several reasons.

In some cases, the servers are setup so as to not give a proper 404 ... that should be changed.
Serving a 302 to some other page (such as the homepage) isn't overly smart ... an can cause issues (duplication, and in the case of robots.txt files, if not found and shown html instead,  googlebot may not crawl!)

In other cases it's due to Scripts ... and the script accepting any old paramters (correct or fake), and loading up the same content (very common!).
In such cases, coding a 404 may be hard work (esp. if you don't do programming) ... but a Canonical Link element would really help!

For others - it could be bad/dodgy ReWrites/Redirects, taking partial URLs and tacking them on incorrectly.
Double check things when you create Redirects/ReWrites.


Similar effects may be achieved with incorrect relative URLs/incorrect base href elements being used.

Same ways to check as posted before : Check the server logs for the bad requests and look at the referrers and/or run Xenu Link Slueth, and see if you can see what links to those URLs.


[ /Appendment ]