Friday, October 22, 2004

Whitehouse and robots.txt

The Brad Blog has this story about how the Whitehouse might be attempting to "clean" the site up by removing some audio/video clips. Some commenters suggested that it's a moot point since the Wayback Machine archives everything anyway, and others corrected them saying multimedia content is not archived.

More interesting for me was one of the comments which said:

«Even for HTML files (such as the list of Coalition members) it would be a simple matter for the White House to instruct the Wayback Machine to remove them from its archive using robots.txt, like it has already done for most Iraq-related documents.»
So, I was curious and got the robots.txt file [1].
$ wc -l whitehouse-gov-robots.txt
1972 whitehouse-gov-robots.txt

$ grep iraq robots.txt | wc -l iraq
So, "iraq" was mentioned in more than 42% of the lines in the file. Here are some of the lines:
Disallow:       /911/911day/iraq
Disallow:       /911/progress/iraq
Disallow:       /911/sept112002/iraq
Disallow:       /deptofhomeland/analysis/iraq
Disallow:       /deptofhomeland/iraq
Ok. Maybe the Whitehouse doesn't want the public to know about the President's public utterances about Iraq and 9/11, which might come back later to haunt him.
Disallow:       /firstlady/healthystart/iraq
Disallow:       /firstlady/iraq
Disallow:       /firstlady/whitehouselife/iraq
Disallow:       /firstlady/recipes/iraq
Hmm. I guess the First Lady has some recipes for Iraqi people, but doesn't want Google to index them.
Disallow:       /kids/barney/iraq
Disallow:       /kids/pets/iraq
Disallow:       /teeball/iraq
Disallow:       /tee-ball/iraq
Right! We don't want anyone to know what the Whitehouse has said about Kids and T-Ball in Iraq. That's highly sensitive material.
Disallow:       /vote/iraq
Ah, the truth has come out. Finally!

Looks like somebody went overboard and added /iraq to every folder on the site.

[1] For those who might not know what a robots.txt file is, it's a file maintained by website administrators to "suggest" web-crawling robots (like google) from scanning parts of their site which they don't want indexed. A well-behaved web-crawler is supposed to heed the suggestions.


