|Bulk content removal request||NewsBlaze||4/28/13 2:38 PM|
I've read the FAQs and searched the help center.
My URL is: http://goo.gl/qtUdj
Over the past 18 months, I have deleted around 2 million URLs from the website.
GWMT still shows we have 800,000 pages in the index, even though for the past 5 months there should only be around 100,000
The googlebot is still requesting those pages, even though the remaining pages of that type (press releases) are now on subdomains.
They were changed over to subdomains around December 4th last year.
I have been trying to help get them out of the index in a number of ways.
Tried the robots.txt method, using patterns.
- That didn't appear to work
- GWMT just complained about not being able to access them
Tried requesting them through the URL removal tool
- can only do about 1,000 per day, but it is better than nothing,
even though I realize its hard to do a few million that way.
Tried changing the response code to 410 ( after I saw Matt Cutts talk about that)
- I don't know if it is working, but I see googlebot requesting the same pages multiple times.
Please can you help me to remove these and clean up the google index.
The last thing I want is to have pages in the index that are no longer there, that disappoint visitors.
Just to let you know, we have signed contracts with most of the large PR distribution companies to display their releases in full.
We have been crushed by Panda (I assume, because nobody can really confirm that) and I'm desperately trying to stop that happening.
(but that is for another post I will make )
We also have our own news, from 200+ writers and 3 editors - some of those writers also send their news to other sites.
What I want to do here is the following.
1. try to get your assistance to have a bulk removal of deleted pages from the index,
because it isn't good for google, readers or me.
2. I don't care if this is not causing a problem for my site, it is readers I care about, as it has always been,
and I don't want to be attracting people only to serve them up a missing page page.
I can give you the patterns, if that is any help.
A long time ago, I tried to get assistance in these forums and got nothing useful,
so I've been trying to do this on my own for the past 24+ months and I'm worn down, because nothing I've done has helped,
even though I've done 50+ projects, as well as trying to keep the site going.
I'm not complaining here, just want to let you know why I didn't persist with that, when it was obvious I wasn't going to get any help.
|Re: Bulk content removal request||NewsBlaze||4/28/13 5:15 PM|
The press releases for business readers are now all on separate subdomains. - 4 months ago
Those full releases are all set as "noindex" - 4 months ago.
If you want to step up the googlebot fetching, the server can probably handle 500,000+ requests per day
(assuming that the 410 return code will result in those pages being dropped)
I am not aware of any manual spam action on the domain.
|Re: Bulk content removal request||NewsBlaze||5/1/13 3:50 PM|
OK something is happening.
This note is for anyone else wanting bulk removal.
I can't guarantee that this is the answer, but as nobody seems to know or care, this is the best clue I have.
Our latest index status shows there have been 3 days of reductions from over 800,000 indexed pages, each day was a reduction of 25,000 pages.
This is good, because we only have just over 100,000 pages, since we deleted all the others.
Then, the most recent drop - day #4, - was by 100,000
How it was done.
1. We tried using robots.txt to tell googlebot that those pages (based on a pattern) were not to be indexed.
But because they were already in the index, maybe googlebot though it would just ignore them and leave them in the index?
- In any case, after a few months, that made no improvement at all, other than WMT stopped saying it couldn't access those pages.
So we removed the pattern from robots.txt.
2. I read that returning a 410 code, instead of 404 might be treated differently.
Lots of confusing information about that and no clarification in google help pages.
So the return code for pages that were definitely gone was changed to 410
For a while, this made no difference whatsoever, and in fact, at one point the numbers went up, but maybe that was just googlebot/WMT out of sync or other reasons.
Finally, we see the drops over the past 4 days now.
No way to tell if posting here and telling the story helped at all.
So if I needed to do bulk deletions on another site, this is what I would do:
1. Set the return code for those pages to 410, not 404
2. Come in here and say what you've done and leave your shortened URL here.
3. Wait and see what happens.
You will need to work out how to change your return code to 410
Mine is easy because I wrote all the controlling code. YMMV.
Good luck !
|Re: Bulk content removal request||NewsBlaze||5/6/13 7:09 PM|
One more note.
If you are manually deleting many URLs, be sure you're really awake, because you can wipe out your whole site.
How that happened last night, I don't know, and why it would take action
when the return code was a 200, I have no idea - but that is what it did.
If this happens and you notice it, just cancel the request,
or review what you deleted and undo the ones you didn't mean to do.
Sure would be nice if there was a way to say drop anything that isn't in the sitemap, that is older than X number of days.
|Re: Bulk content removal request||Everwood_Farm||5/16/13 4:42 PM|
Hey, thanks for taking the time to post this. I am having a similar - though smaller scale - problem since we changed out site over from ASP to PHP. After doing several thousand manually, I was dreading doing 80k more.
|Re: Bulk content removal request||NewsBlaze||5/21/13 6:04 PM|
Glad that helped, Everwood_Farm
Three weeks later, the indexed page count is down to around 592,500, so that appears to be working
(of course it could be coincidence too)
Another thing to look for is your sitemaps.
Check to see if WMT says you have more sitemaps than you really have.
Go to the Sitemaps area in WMT and get it to show you the list of your sitemaps.
Change the setting to show 500 sitemaps on one page and it will tell you, at the bottom left of the page,
how many it is displaying of the total it thinks it knows about.
I have asked a question about it in the Sitemaps Group section.
|Re: Bulk content removal request||NewsBlaze||5/21/13 6:24 PM|
Here is the other sitemap question:
|Re: Bulk content removal request||JohnMu||5/28/13 6:22 AM|
Perhaps some clarifications can help ...
- The URL removal tool is not meant to be used for normal site maintenance like this. This is part of the reason why we have a limit there.
- The URL removal tool does not remove URLs from the index, it removes them from our search results. The difference is subtle, but it's a part of the reason why you don't see those submissions affect the indexed URL count.
- The robots.txt file doesn't remove content from our index, but since we won't be able to recrawl it and see the content there, those URLs are generally not as visible in search anymore.
- In order to remove the content from our index, we need to be able to crawl it, and we should see a noindex robots meta tag, or a 404/410 HTTP result code (or a redirect, etc). In order to crawl it, the URL needs to be "not disallowed" by the robots.txt file.
- We generally treat 404 the same as 410, with a tiny difference in that 410 URLs usually don't need to be confirmed by recrawling, so they end up being removed from the index a tiny bit faster. In practice, the difference is not critical, but if you have the ability to use a 410 for content that's really removed, that's a good practice.
For large-scale site changes like this, I'd recommend:
- don't use the robots.txt
- use a 301 redirect for content that moved
- use a 410 (or 404 if you need to) for URLs that were removed
- make sure that the crawl rate setting is set to "let Google decide" (automatic), so that you don't limit crawling
- use the URL removal tool only for urgent or highly-visibile issues.
|Re: Bulk content removal request||NewsBlaze||6/12/13 9:36 PM|
Thank you JohnMu , I appreciate the info.
So I will stop doing those URL submissions - except when there is a need to have a page out of the index quickly
- and it is fast, thank you.
Several months ago, I stopped using the Robots file for that purpose, after I realized that didn't help and suspected it might make it worse
I have changed to 410 and that does appear to be working, but it is very slow - only seems to purge less than 20,000 per week.
I have a 4 processor quadcore server, so it can handle quite a beating, and crawling is set to automatic.
I have two questions then,
1. Is the ratio of number of useless items in the index (420,000 of them)
to real pages hurting my site and ranking? (i.e. do I need to worry about it)
LOL - not that I can do anything about it! - unless #2 below can happen
2. Is it possible for you to get the googlebot to check the 525,000 pages it thinks we have in the index faster than it is doing?
- First, neither you nor I want the index to be cluttered up with the 420,000 useless items.
- Second, it just distracts me from other stuff I want to work on
- Third, that might actually be what is hurting us
- Currently, the bot only fetched around 10,000 to 14,000 pages per day
- It used to do 100,000+ per day.
- If you could get it to do that this week, all the garbage could be gone by this time next week, because all the bad ones will return 410
This whole thing has been killing me for 2 years, and I just want it to stop.
I've been working very hard, and I may not be quite there yet, but issues like this make it hard to concentrate on things that matter.
Thanks again for your response. JohnMu.