Whitehouse and robots.txt
The Brad Blog has this story about how the Whitehouse might be attempting to "clean" the site up by removing some audio/video clips. Some commenters suggested that it's a moot point since the Wayback Machine archives everything anyway, and others corrected them saying multimedia content is not archived.
More interesting for me was one of the comments which said:
«Even for HTML files (such as the list of Coalition members) it would be a simple matter for the White House to instruct the Wayback Machine to remove them from its archive using robots.txt, like it has already done for most Iraq-related documents.»So, I was curious and got the robots.txt file .
$ wc -l whitehouse-gov-robots.txt 1972 whitehouse-gov-robots.txt $ grep iraq robots.txt | wc -l iraq 835So, "iraq" was mentioned in more than 42% of the lines in the file. Here are some of the lines:
Disallow: /911/911day/iraq Disallow: /911/progress/iraq Disallow: /911/sept112002/iraq Disallow: /deptofhomeland/analysis/iraq Disallow: /deptofhomeland/iraqOk. Maybe the Whitehouse doesn't want the public to know about the President's public utterances about Iraq and 9/11, which might come back later to haunt him.
Disallow: /firstlady/healthystart/iraq Disallow: /firstlady/iraq Disallow: /firstlady/whitehouselife/iraq Disallow: /firstlady/recipes/iraqHmm. I guess the First Lady has some recipes for Iraqi people, but doesn't want Google to index them.
Disallow: /kids/barney/iraq Disallow: /kids/pets/iraq Disallow: /teeball/iraq Disallow: /tee-ball/iraqRight! We don't want anyone to know what the Whitehouse has said about Kids and T-Ball in Iraq. That's highly sensitive material.
Disallow: /vote/iraqAh, the truth has come out. Finally!
Looks like somebody went overboard and added /iraq to every folder on the site.
 For those who might not know what a robots.txt file is, it's a file maintained by website administrators to "suggest" web-crawling robots (like google) from scanning parts of their site which they don't want indexed. A well-behaved web-crawler is supposed to heed the suggestions.