OH MATRON! Google has just lifted its skirts to reveal its robots.txt parser, which it has made open source, thereby revealing one of the secrets behind its web crawling ways.
Known officially as the Robots Exclusion Protocol (REP), it's basically a web parser function that webmasters can use to keep some of their website's content out of the view of Google's search engine web crawler.
By using the REP to writing content into robots.txt files, website wranglers effectively tell Google Search's Googlebot and other automated crawlers on the web which parts of the site to avoid perusing, thereby keeping some content private and avoiding any unnecessary indexing.
For 25 years or so, the REP has effectively become a de facto standard in managing the sniffing around that web crawlers do. But it never became an official standard.
No official standard means no official guidelines on how to use REP, which has lead to problems where different sites have interpreted the robots.txt format in different fashions. This can lead to web crawlers interpreting robots.txt files in different ways, which can bork things for web searches when looking for the most relevant content.
For creators of web crawlers, the lack of an official standard means they are left with uncertainty around how their crawlers should handle things like large robots.txt files.
Google hopes that by open sourcing the robots.txt parser, developers will be able to prod around in the C++ library the Googlebot uses for parsing and matching rules in robots.txt files. Essentially, this should pave the way for a better understanding of how crawlers interact with robots.txt files and thus pave the way for some form of standard practices.
Google has published some bits of a draft proposal to haw it reckons REP should be used, which it will submit to the Internet Engineering Task Force (IETF); the folks that handle standards on the internet.
There's no guarantee that open sourcing REP will lead to it becoming an official standard, but it's a step down that line. And for the average web user, it should mean better and more accurate content is served up when searching; in the near future, your search for 'good hand crank' might not serving up questionable links for example. µ
Bad for shareholders, mildly good for the planet
YouTube on the Tube
Claims that it hasn't ever actually worked