Using Robots.txt to Control Search Engines

While reviewing some of my referrer logs, I noticed a couple of hits to specific picture pages based on Google search results. I decided it was time to learn how to keep search engines away from certain parts of my site. Following are some things I learned along the way.

Robots.txt

There are two published methods for controlling spiders and robots:

  • Meta tags
  • Robots.txt file

I’m using the Robots.txt method because it’s cleaner (and may be more of a standard). It’s similar to CSS: when I need to modify my settings, I only need to edit one base file rather than individual HTML files.

The basic method is to create a file called “Robots.txt” and store it at the root of your website. Well-behaved robots and spiders will check this file before attempting to scan your site. The rules defined in Robots.txt will define which portions of your site is accessible and which should be skipped. (Note: This discussion won’t help with misbehaving robots {$.EM$} such as those scanning for email addresses {$.EM$} because by definition they aren’t well behaved!)

A simple Robots.txt might look like the following:

User-agent: *
Disallow: /mypictures/
Disallow: /test/

The ‘User-agent’ field can be used to be specific about certain robots. Using the ‘*’ wildcard will make the rule apply to all robots (which is what I did). The ‘Disallow’ field entries define certain parts of the site which should not be crawled or indexed. Essentially, you want to list those portions of your site which you do not want to appear in search engines.

Possible reasons for excluding something from a robot might include:

  • Private content
  • Contents change frequently and you want to avoid link rot
  • Content is not worth searching (stats, log files, etc.)

Be careful about using it to hide private folders (“disallow: /secret/”) — the robots.txt file is readable by all, so someone might use it to find secret folders (admin, for example). Instead, don’t link from your main site to anything private (if neither index.html nor any of the sites it links to reference your private page, it’s essentially hidden from spiders and robots). For additional security, password-protect the private pages.

Comments

In my case, I stopped letting search engines crawl my photo collections. My photos don’t have useful filenames or captions and I started noticing a few weird hits from Google. In my case at least, I don’t expect the photos to be the reason someone finds my site. Instead, it will probably be someone who knows me or found the site through another link and is just browsing.

As time passes, I’ll be watching my server logs to see if spiders and robots are obeying my Robots.txt file.

Further Reading

The Web Robots Pages are the best starting point, in particular the FAQ and the discussion about Robot Exclusion.

Posted in: Web