Categories: Crawling, indexing & ranking :

robots.txt file isn't working! need to stop the bot from crawling certain directories

Showing 1-19 of 19 messages
robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/23/13 2:39 PM

I've read the FAQs and searched the help center. 

I tried blocking specific pages (/store and /gluten-free-store) and I have also asked the crawler not to crawl these pages/directories (www.stuffed-pepper.com/robots.txt). 

Why isn't it working?

Thanks so much for your help.

Heather (newbie)

Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Ashley 3/23/13 3:02 PM
What the heck is this when I try to access your page?!?!

Checking your browser before accessing www.stuffed-pepper.com.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds...


That's not good...

Plus, you seemed to have blocked a URL on the main site, but you're allowing your development subdomain to be indexed (bad news)

You'll see that you are not blocking that URL on the dev subdomain robots.txt file
http://dev.stuffed-pepper.com/robots.txt

Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/23/13 3:16 PM
hi ashley,

its from cloudflare.com, which helps to stop spammers and "bad" bots. 

that isn't the issue. the crawlers are getting through. the subdomain may be allowed to be indexed, but that doesn't seem to be the majority of where the traffic is going. I have no problem disallowing it. do i just stop it the same way - in the robots.txt file? 

even so, the robots.txt file does not seem to be working correctly. can you help? i would be so grateful!

thanks so much.

heather
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories fedesasso 3/23/13 4:17 PM
Your robots.txt syntax is incorrect:
the "user-agent: googlebot" directive is repeated twice

in particular where the statement "Only one group of records is valid for a particular crawler" is.

Hope this helps

P.S.: Honestly I don't know how crawlers can get through what Ashley pointed out, your site is returning 403 or 503 or 302 or 200 for your robots.txt, but I assumed they do as you said, and answered accordingly
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/23/13 4:35 PM
Thank you so much, fedesasso.

So are you saying that instead of:

User-Agent: Googlebot
Disallow: /gluten-free-store/

User-Agent: Googlebot
Disallow: /store/

I should just have:

User-Agent: Googlebot
Disallow: /gluten-free-store/
/store/
          
 
Or do I have to put "disallow:" twice?

Please excuse my ignorance!

Heather

P.S. Cloudflare is able to determine what type of entity is trying to get through to the site, I think based on its IP address. It can identify known spammers, zombie bots and crawlers. I don't know how it actually works, but its kind of cool, if you want to check it out.

I don't know what you mean by: "your site is returning 403 or 503 or 302 or 200 for your robots.txt"


Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Ashley 3/23/13 8:08 PM
I would remove this service from Cloudflare - not only is it troublesome technically, it's horrific for the user experience. 
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories fedesasso 3/24/13 2:14 AM
Dear Heather Jacobsen,
by now you probably already googled for the answer, but yes, you have to put "Disallow" twice in your case (once for every specified path).

I strongly suggest you to test your robots.txt with Google Webmaster Tools, using
- "Fetch as Google" first, to see if Cloudflare permits googlebot to retrieve the file correctly
- "Blocked URLs" afterward, to see if the outcomes are what you meant
and would consider disabling the service as Ashley suggested. By the way, your pages are already slow enough without this added burden.

Regards
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/24/13 8:48 AM
Hi fedesasso,

So I have now put in:

User-Agent: Googlebot
Disallow: /gluten-free-store/
Disallow:
/store/

And the crawler cannot index /gluten-free-store/  but for some reason it still accesses /store.  Isn't that strange? 

I did the robots.txt test on the webmaster page (that's how I know). I also did Fetch as Google and there are no issues there.  

Re: robots.txt file isn't working! need to stop the bot from crawling certain directories fedesasso 3/24/13 12:23 PM
Hi,

> And the crawler cannot index /gluten-free-store/  but for some reason it still accesses /store.  Isn't that strange? 

You mean you tested a /store/... url with GWT "blocked URLs" and the tool didn't say it's blocked?
If so, yes, it's weird.

If that's not what you meant, keep in mind that google keeps the old robots.txt cached for at least 24h, but can in some cases keep it for longer when the server return a status code equal to 200 OK or 404 Not Found for the robots.txt file.

Hope this helps
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/24/13 1:34 PM
I mean that I tested a /store/ with "blocked URLs" and the tool said that it allows it. But when I tested /gluten-free-store, the tool says its blocked. Why wouldn't they both be blocked.

Maybe i need to wait 24 hour. I'll check back tomorrow.

Thanks for your patience with a newbie!

:)

Re: robots.txt file isn't working! need to stop the bot from crawling certain directories fedesasso 3/24/13 2:04 PM
Just to clear out doubt:
when you tested in GWT, the robots.txt version it showed was the updated one?

If it wasn't, that would explain the quirkness. 

To test the new version, you can also copy and paste it in GWT, without waiting for it to update. That wouldn't of course make google see the updated version, but you could at least test it.
Hope this helps
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories webado 3/24/13 2:17 PM
The robots.txt file content will only become known to Webmaster Tools 24 hours later, not on the same day you changed it.

You can also request directory removal for /store/ and /gluten-free-store/  from Webmaster Tools if any urs from those directories are currently indexed in Google.


Do get rid of Cloudflare hosting.

Check the robots.txt file and the rationales behind blocking whatever you are blocking. 

Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Ivo van der Veeken 3/24/13 2:30 PM
@Webado and others: Does Google hard refresh the robots.txt file? I just clicked the link here again, and I didn't see an update. If Google doesn't, it might take even longer to see the changes.
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories webado 3/24/13 2:37 PM
It takes a day for Webmaster Tools to get the fresh copy. You cannot rush it.

But in normal crawling, the robot picks it up every time it starts a crawl of a site.
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories fedesasso 3/24/13 2:38 PM
@Ivoos,
it's even more trickier. 
Google caches robots.txt for at least 24h, but could be even longer in case for example when it can't access it for some time... but the version shown in GWT seems to obey to another cache, and can take even longer to get updated.
Sometimes I succeeded seeing an updated version there after fetching it as google and submitting it to the index, but I can't swear on cause-effect.
Besides, you can as I said use the tool to test a newer version by simply pasting in the textarea the new content.
Hope this helps
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories JohnMu 3/29/13 3:26 AM
Hi Heather

Taking a step back - what is it that you're trying to achieve with the robots.txt file in your case? 

First off, one thing that is likely wrong with your robots.txt file is that crawlers obey the most specific user-agent line, not all of them. So for Googlebot, that would be only the section for Googlebot, not the section for "*". The "*" section is very explicit in your case, so you'd probably want to duplicate that. Past that, why is there a section for Googlebot? Are these URL patterns that you want to disallow for all search engines perhaps?

Taking another step back ... the "*" section is likely much more complex than you really need. When possible, I'd really recommend keeping the robots.txt file as simple as possible, so that you don't have trouble with maintenance and that it's really only disallowing resources that are problematic when crawled (or when its content is indexed). My guess is that only /search/ is really problematic (since it puts a significant load on the server), but you'd know your site best. 

Finally, keep in mind that the robots.txt file doesn't control indexing. If you want to prevent URLs from being indexed, you would need to allow crawling, and serve the appropriate robots meta tag (or x-robots-tag header). 

So in short, can you explain what you're trying to do? Then we can probably help you to find a good solution :)

Cheers
John
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories Heather Jacobsen 3/31/13 12:18 PM
John, 

Thank you so much for your question.

I am trying to block google from crawling my /store and /gluten-free-store directories because there were too many automated pages under these directories, thanks to the Amazon store module. I have now deleted my Amazon store, but the crawler keeps going back to pages under these directories that no longer exist. Because of this, the crawler is eating up a ton of resources on my site. So I removed both URLs, as well as tried to get the robots.txt to work. I saw that the crawler was still crawling /gluten-free-store?... pages, so I added the ? into the robots.txt as well. Still not working.

I paused the Cloudflare, even though I am not sure why people don't like it. After I paused it, I had a bunch of spammers create new accounts. But my resource usage didn't change, nor the crawling by robots. So I started it up again.

Finally, I resubmitted my site to Google, in the hopes that the robots will no longer index these pages.  That was a couple of days ago. As of now, the robots still seem to be crawling /gluten-free-store?* 

Any advice you can give me will be SO appreciated.

Thanks again.

Heather
Re: robots.txt file isn't working! need to stop the bot from crawling certain directories brotor 4/7/13 12:45 PM
Thats a good point. If we can't control the crawling ressources with robots.txt (anymore ) How can we ? in the Webmaster Tools with Paramter Handling !?
I see a lot of websites that produce millions of URLs cause of using paramters.




Re: robots.txt file isn't working! need to stop the bot from crawling certain directories webado 4/7/13 12:49 PM
The robots.txt file works provided it's 100% correct.