Re: Crawling Errors for pages that Don't Exist // Google crawling fake pages // 404 Not Found Errors ::: Auto-Response :::
Dec 4, 2010 10:28 AM
Posted in group:
Crawling, indexing & ranking
Please - do Not post your questions in this topic! -----------------------------------------------------------
============= Google Crawls =============
Well - that's the basic principal of a web-crawler ... it crawls things. This means that G has obtained a URL (somehow) and noted it down ... then tried to crawl it at somepoint.
============= How does it get URLs? =============
Primarily - from Links ... those on your site and on other sites.
still - it may attempt to "guess" at URLs ... if G see's you ahve 20
pages in sequence (page1, page2, page3) it may go looking to see if
there is a page21 and page22 etc. It may also be looking at any Forms on your site ... as in some cases, Googlebot may use a form to explore your site.
Something else to keep in mind - Google Remembers things! So if there "used to be" something at that URL - G may well remember it, and be trying to revisit it
============= But 404's are BAD! =============
Actually - directly, no they are Not! Google is not going to penalise/punish you/your site because it see's a 404.
Indirectly - it may have a knock-on effect, or several. It means that G is using up requests to your site/domain/server for URLs that are wasted, rather than those that do exist, which may slow down how often other URLs are crawled (this is guessed - Not confirmed!).
============= So what about "these" Bad URLs??? =============
Well - you can start looking for info to checkout the bad URLs, and where they are coming from.
--- Google WebMaster Tools (GWMT/GWT) --- You are seeing Crawling Errors. In the Crawling Error section, it gives a little table showing the bad URLs. On the right - it may show you a link (linked from). Click it! Congratulations - that will show you were the links originate.
--- Xenu Link Slueth --- You can run this tool on your site (http://home.snafu.de/tilman/xenulink.html). It will crawl your URLs, look at your pages, and if you are linking to the bad URLs - it should show them as being 404s. You can then right-mouse-click on the bad listings, go to properties. Voila - it will show you what URLs are linking to it.
--- Server Access Logs --- You
can poor over the Servers Raw Access Logs (should be available via your
Hosting control panel - if not, then ask your host - if you have no
access lgos ... Get a New Host Now!). Look for bad resposne codes (do a search for "404"). Look for the Referrers URL. You may see it as your own site, or someone elses.
============= But what can I do about it? =============
Well, that depends - entirely on how the bad URLs are occuring.
--- External Sources --- If the Bad URLs are originating else where (on some other site) you have 3 main options; 1)
You can contact the owners and ask them to correct it (or if you
submitted it, go and edit it, and be mroe careful next time!) 2) You can setup server/scripted 301 Redirects (go read the "Changed URL" topic from above) 3) You can simply live with it
--- Internal Sources --- If the Bad URL originate from your own site, then you have the following options; 1) Go and fix the flamming code! (Sorry, but kind of obvious don't you think?) 2) Revise how you use/supply URLs (use Absolute instead of Relative, less room for screwups!)
3) Test each link on your site when you make them!
4) Go to the page that seems to be the source of the error/bad link ...
and use the TAB key ... keep an eye on the Status (bottom right of the
browser) to see the URL that shows up ... you may tab through and find
the dodgy link. You may want to do it with and without JS enabled (as
sometimes it's the JS that causes the issue).
============= I've looked - but cannot find the fault/error! =============
Well, not being funny - go look again, but harder/properly :D
Common causes? * Invalid code - if you don't close up your tags correctly, you may generate incorrect URLs. * Relative URLs - these are a common problem, and may result in attaching part of the path to a previous URL * Incorrect Base Href element - this would result in attaching the relative path to the incorrect root * Incorrect Redirects - if you have ReWrites or Redirects, make sure you have them working to correct URLs * Incorrect Canonical Link Element - if you have the wrong URL in the href of the CLE, then G is gonna get it wrong as well *
(inside the opening/closing script tag, use //<![CDATA[ your JS
//]]>) * Check your Form Action URL - make sure you aren't letting G get the wrong idea
Common fix? * Start using Absolute/Full URLs (use 'href="http://blah"' rather than 'href="/blahblah"' )
============= It's caused by an External site, with strange paramters? =============
[ AGAIN - I have to post this reminder due to the sheer number of people that are hard of reading ]
is a "general auto-response" post.
is Not a Topic for discussion;
It is a point of reference to
having to type the same answer repeatedly due to the sheer number of
times this question is asked and is meant as an aid for people that
don't seem search/read the various other posts regarding this topic.
you for taking the time to read this
Please - do Not post your questions in this