Preventing PDFs from being spidered.

On an email list I lurk on, someone asked if there was any way to prevent PDFs on their website from being indexed by crawlers, such as Googlebot, MSNBot, and their cousins. A helpful reply pointed tehm to Jakob Nielsen sidebar on Preventing Search Engines from Spidering PDF Files. Nielsen suggesst using your robots.txt file or password protecting your documents. While the latter is a foolproof solution, the former is really insufficient.

Using robots.txt doesn’t guarantee that your PDF files won’t be
spidered. There is nothing beyond convention and etiquette that
enforces the rules defined in robots.txt. Well behaved spiders, of
course, do obey it so this will keep your PDF’s out of Google, Yahoo,
MSN, etc. But there’s nothing stopping me from writing my own crawler
that ignores robots.txt and grabs your precious PDF files.

One solution (although it can be defeated via User-Agent spoofing)
is to pass downloads through a script and check the user agent string.
If its an allowed agent, then the download continues normally, otherwise
they’re blocked. On Apache you could do this with all with rewrite rules that passes all pdf downloads to a PHP script that inspects the HTTP_USER_AGENT string. This transparently serves the /download_not_allowed.html page to user agents that do not contain the strings Gecko or MSIE, which are in most browsers.

RewriteCond %{HTTP_USER_AGENT}   !Gecko   [AND]
RewriteRule .pdf$; /download_not_allowed.html

An alternative, less user-friendly solution, is to allow downloads
verified by a CAPTCHA, commonly used to block spam on blogs. That is,
the download only happens if the user correctly identifies the
words/picture in a distorted image.