|Riddle me this, weird GET's in my logs||Alexander Maassen||7/20/12 3:25 PM|
Today I was checking some stuff and found this in the access logs of www.scarynet.org, my question: why on earth does the bot even consider these urls existing?
184.108.40.206 - - [21/Jul/2012:00:10:39 +0200] "GET /krdiyknd.html HTTP/1.1" 404 5352 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/b
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/20/12 3:29 PM|
A possible googlebot glitch or maybe a spoofed ip to look like googlebot?
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/20/12 3:34 PM|
|Re: Riddle me this, weird GET's in my logs||Lysis||7/20/12 3:38 PM|
ScaryNet, I escalated this, if only because it's a weird coincidence. Not sure if a Googler will comment, but I thought it was a strange coincidence, so maybe they will look into it.
It could be a new hack or a scraper sniffing people's sites, and if it is, I'm sure Google can give us a headsup. Otherwise, normal noise. But we'll see. :)
|Re: Riddle me this, weird GET's in my logs||JohnMu||7/20/12 5:29 PM|
Thanks for forwarding & including the link to the other thread -- I'll take a look at what's happening here with the team and post once I know more.
|Re: Riddle me this, weird GET's in my logs||50BMG||7/20/12 8:13 PM|
Thanks to Panda_Effects, and ScaryNet for linking my thread about this issue to this one.
Thanks for the escalation. I could speculate about it's cause, but won't for the time being.
I would however note that the inquiries to my site don't appear to match any of yours here. [which I do see are repeated, at least in part on multiple sites]
Thinking the BOT might be legitimately looking for signs of Malware, I scanned all drives on the server and surrounding systems for those names. None turned up. Did any of you do this? and Did you find such a file?
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/20/12 8:42 PM|
"could speculate about it's cause"
Speculations are just that, so I would be interested in hearing your thoughts.
|Re: Riddle me this, weird GET's in my logs||webseos||7/20/12 8:53 PM|
I think it is a hack case, check your folder's permission level.
|Re: Riddle me this, weird GET's in my logs||50BMG||7/21/12 12:07 AM|
Ok... what the heck.
Q) Are they from Google or not?
A) Seems to me they are. In my case, if they were "Spoofed", the senders would have to know what Google IP address normally Crawled my site, and duplicate it. But if they did this, they were clever enough to manage it so as not to interfere with Google's legitimate crawling of the site, which was simultaneous [within minutes], and who's activity I can validate in Google's WebMaster tools.
But there is more... it's multiple sites, again nearly at the same time. I take it that your activity too, was from the customary crawler addresses. What's more, these are different IPs than my crawler, meaning the spoofs would be multiple sources, multiple targets, again at the same time. Too much sophistication for me, I think it came from their network.
Q) Are they Legitimate crawling requests?
A) Two simple reasons I say "NO".
So... they came from Google, and aren't normal crawling requests. What could they be?
1) I've never seen it, but Google could have chosen today to scan our sites for signs of Malware. But where would they get these names? If it is such a probe, it could only be with specific knowledge of the pest sufficient to predict the algorithm it would use to create the names. Would Google take it upon themselves to do this? I don't know.
2) It could be that the Crawlers themselves, or perhaps the Crawler network, [less likely] has been hacked. In this case, I would expect that the requested names were generated by a pest that needed to find signs of it's "brethern" on the web. They chose to use Google's web identity, knowing that it's crawlers would be permitted, and that because of the volume of Google traffic, would be least likely to be recognized.
3) Google knowingly issued the crawls, perhaps in response to a special request by a government agency. A lot happened in the 24 hours preceding this event, and I would not think it unreasonable for Homeland Security or the FBI, or other agency, to have been able to make such a request in that time. Why the government would not perform this scan with their own resources, I do not know. Perhaps it's a reason I don't count this as a likely scenario. However, if it is correct - well, you might not have long to read this post.
How's that for speculation?
|Re: Riddle me this, weird GET's in my logs||webado||7/21/12 12:49 AM|
Robots do not pass referrer information. Your server will never see a referrer from an access by a robot, whether it's Googlebot or any other robot.
You have to wait a few days until those 404s are reported in Webmaster Tools to find out what some of the actual referrers might have been (part of the data collection and reporting for Webmaster Tools).
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 12:55 AM|
What about what bikramchoudhury suggested? What if a good number of sites got "infected" quickly?
Or could be a simple glitch as I mentioned earlier. And still seems feasible it could be a spoofed ip.
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 12:57 AM|
But also believe it could be a test by Google as you suggested to.
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 1:11 AM|
Some others have seen this issue too
|Re: Riddle me this, weird GET's in my logs||50BMG||7/21/12 1:19 AM|
Rechecking the logs, I see your point about the referal link. As for the WebMaster tools, I saw some items listed from today, but not for after the time of the anomalies. So I will indeed have to wait and see. Thank you. I'll have to reconsider my speculations.
I guess this one is going to have to be explained to me. What folder permissions are we talking about?
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 1:26 AM|
"What folder permissions are we talking about?"
Do not know. But since hackers can change people's .htaccess file seems reasonable as a possibility.
But I am inclined to believe like you that it is a test by Google. However, it seems to be a very strange test as what possibly could be tested by looking for garbage html file?
But on that link I supplied above one person said "changed a batch of individual 404s or 410s to page-specific redirects. So it made sense for Google to want to confirm that I hadn't stepped into Soft 404 territory by redirecting everything". Something else to consider as possibly it is a quick test check?
|Re: Riddle me this, weird GET's in my logs||50BMG||7/21/12 1:34 AM|
If the GoogleBot found one of the files, I could see the conclusion I'd been hacked. Not all of you though.
So not having been hacked - what use is it to be checking my system for why the bot asked for these?
I'm missing this line of thought entirely.
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 1:43 AM|
Since you know you were not hacked I agree. Just think many things should at least be considered.
Did you recently change anything with the .htaccess or httpd.conf that possibly could have affected something which triggered Google to start checking for something like that person on webmasterworld said? I have seen when I have made changes in .htaccess and it was not complete enough see different errors in WMT. Had one today and thought I pretty much had everything cleaned up or very close.
And it is possible that Google did not see anything recent but is doing tests on random websites to check for issues. And if they do that for a number of sites in batches could explain why a few of you saw the same thing today. Could be many more that just never write in groups/forums or maybe do not watch their logs as actively as some of you do?
|Re: Riddle me this, weird GET's in my logs||50BMG||7/21/12 2:25 AM|
No recent changes. File contents of .htaccess and httpd.conf are unaltered, and the timestamps agree with the last time I edited them. I am less inclined to think I provoked this somehow, because I see other admins with the same issue in the same time period on their sites.
I agree that many admins may not have looked to see these in their logs. [yet]
Incidentally... no further probes of this kind have occurred as of this hour, the last being Jul 20 14:36:52 EDT.
|Re: Riddle me this, weird GET's in my logs||Panda_Effects||7/21/12 2:37 AM|
It does seem likely it is a test that Google is doing on a number of sites if they saw something or they are just doing it randomly as a test on all sites?
With the .htaccess or httpd.conf it would not have to be from any recent change as since Panda I have seen they are checking things much more now and seems differently and have caught a number of things I had no idea I had not done somethings the best way. So to me it is good if they do some testing and show them in WMT, but seems it might be helpful if they explained some of the tests, thus possibly less confusion? But maybe they do not want to say exactly? I recall seeing strange urls in WMT before as well quite some time back. And have seen strange gets in the logs as well. Some of said that it is feasible that Google uses other ip addresses to check if different content is served to search engines and visitors and that makes sense.
|Re: Riddle me this, weird GET's in my logs||cristina||7/21/12 3:38 AM|
Do you have something like this in web crawl errors in Google Webmaster Tools? Look in GWT at both site URLs, with www and without www.
|Re: Riddle me this, weird GET's in my logs||50BMG||7/21/12 5:53 AM|
"No" - They are still not there as of this hour. The last item in that list [a short list] was in June and is a regularly formed query.
|Re: Riddle me this, weird GET's in my logs||webado||7/21/12 6:05 AM|
You have to check your site by www, non-www as well as IP (220.127.116.11) since it responds the same way to all those urls and everything will get logged in the same log with no indication as to which was used.
After you've checked it all, go on and implement this 301 redirection in the .htaccess file to get rid of this canonical problem and send all requests to www.scarynet.org if they weren't to that:
|(unknown)||7/21/12 6:43 AM||<This message has been deleted.>|
|Re: Riddle me this, weird GET's in my logs||cristina||7/21/12 7:26 AM|
Can you check in your server access logs, for earlier dates as well, did you have URLs like these accessed by other than Googlebot?
It is not a solution, but maybe until there will be an explanation for where from/how Googlebot found those URLs, if you want to just stop Googlebot accessing the URLs for now (just to unclutter a bit your server access logs), I see that all URLs end in .html. If you do not have good URLs on your site ending in .html block them in robots.txt to Googlebot with
I repeat, if you do not have good URLs ending in .html
|Re: Riddle me this, weird GET's in my logs||webseos||7/21/12 7:34 AM|
I think Christina is correct, where Google found these URL ?Googlebot dont generate URLs
|Re: Riddle me this, weird GET's in my logs||50BMG||7/22/12 1:45 AM|
On Friday, July 20, 2012 6:34:13 PM UTC-4, Panda_Effects wrote:In that thread, is a supposition I did not consider when speculating earlier:
It supposes that somone has managed to spoof Googel's IP [ from anywhere on the internet ] in order to flood Google with these nusanse responses. This works because the original sender did this - never intending to get the response to his query, but rather it be sent to Google.
Indeed, the fix [maybe I'd think of it as a band-aid] of preventing reads of ".html" files would have fixed this instance, but it wouldn't prevent future abuses.
However, there is still one issue that I don't accept about this supposition... In my case, the "spoofer" had to have figured out the correct IP address of the GoogleBot which traditionally crawls me. After all, there are many many possible IP addresses Google's crawlers could use. Did he just happen to pick the right one? For all of us?
Makes me doubt this "Spoof to Flood" theory. However, since the queries have ceased [since we began our threads BTW], this problem is not presently "on-going". All we'll be able to do is study it from the logs of this period. Perhaps if anyone else is interested, they'll start a thread in which we all post our logs of this event to a single place. Maybe that will lead to more insights.
Oh - and to answer one of Cristina's dangling questions ( to Me? ) I had already checked my logs for previous queries of this type, and there were none. I went back several iterations.
Finally, still no crawl errors on the WMT page for these [or any other for that matter]
Thanks everyone, for the discussion.
|Re: Riddle me this, weird GET's in my logs||50BMG||7/22/12 2:07 AM|
Perhaps this is this real answer.
On Saturday, July 21, 2012 7:16:51 PM UTC-4, Lysis wrote:
|Re: Riddle me this, weird GET's in my logs||JohnMu||7/22/12 3:50 PM|
These requests are from our side, and accidentally requested by Googlebot. It's my understanding that this issue has since been resolved (or at least will be in the near future). Sorry for the confusion & thanks for posting here!