Improving Control of How Web Pages Are Spidered and Harvested
by Michael Jensen, National Academies Press
Among the many challenges we've been facing recently are "crawlers"—Google, Yahoo, and the rest, spidering our sites and harvesting our pages to index.
According to our analysis, there are days when 60 percent of our traffic comes from Google recrawling our site, and 20 percent from Yahoo. We (to our shame) have not made good use of the "robots.txt" file (a file at the Web root that in essence tells crawlers what directories NOT to crawl, or files not to harvest), and in investigating further also have found a number of other options as well. There is, for example, a META tag that one can place in the HTML that says, in essence, "last updated on X date" which some crawlers use to determine if the whole file needs to be reharvested.
There is also a useful "nofollow" tag that we implemented for some of our links. Our home page has "Popular Searches"—a live link that sends a preset query to our search engine. On our search results page, we have "likely searches" (terms extracted from the search results, for a user to try) that did the same sending-of-a-query. So Google would crawl the home page, follow the popular search terms (initiating our CGI script), harvest the search engine results, and then follow all the links on that page, including the 20 "likely searches" listed on the results page.... and on we go, cycling along. That kind of thrashing was killing us.
By simply placing, in the href line, the word "nofollow" (as in [a href="blabla" nofollow]), Google stopped following those links. Made our machines o-so-much happier, and our bandwidth requirements lower.
We're in the process of making other changes, but are going slowly, because we like our stuff to be well-indexed, and don't want to kill the goose laying those golden eggs!