Google Product Forums

How Google bot escapes hashbang fragment when converting them to _escaped_fragment_ query parameter


dgervalle May 21, 2012 12:29 PM
Posted in group: Webmaster Central

Categories: Crawling, indexing & ranking :

After a lot of googling, and careful reading of the full specification of making AJAX Application crawlable, I am still puzzle on how the Google bot escapes hashbang fragments when converting them to _escape_fragment_ query parameter.

According to the documentation, %00..20, %23, %25..26, %2B and %7F..FF are escaped. But this is opposed to the sample provided:
#!key1=value1&key2=value2 <=> ?_escaped_fragment_=key1=value1%26key2=value2
which shows that '=' (%23) are not escaped. 

Applied strictly the specification would double-escapes the fragment, since special characters are already escaped in a normal fragment. Which would means that a value of "A complex value & special chars +" would be shown as:
#!key1=A+complex+value+%26+special+chars+%2B <=> ?_escaped_fragment_=key1=A%2Bcomplex%2Bvalue%2B%2526%2Bspecial%2Bchars%2B%252B

which would requires a double decoding. From what I reads up to know, no one clearly mention this, and therefore, I doubt this was the original intent, and as far as I see, apart from badly encoded URL, the only character that could cause an issue during the transition, is the ampersand, which is both a separator of the fragment and a separator of the query parameter.

Therefore, I currently suppose that the full specification is not correct, and that the only character that gets escaped twice is simply the ampersand. This way, the above would be simplified to:
#!key1=A+complex+value+%26+special+chars+%2B <=> ?_escaped_fragment_=A+complex+value+%2526+special+chars+%2B

For testing purpose, here is a bookmarklet (if it does not get filtered out...) that I have built to help testing the transition in both direction. It detect in the current browser URL and switch the browser location between the two URL. I currently apply only escaping, and unescaping of the ampersand between the two URL (I also cleanup the hash, considering it as a query string in itself, unencoding and re-encoding it as well)

Does anyone know the real algorithm used by the crawler and could elaborate on the above to help me makes this bookmarklet fully accurate ? (Obviously, I could prepare a real world testing waiting for the crawler to check the hypothesis, but this could be long...)

Thanks in advance for your advices.