The indexing of PDFs

Observations of my server logs it seems that PDF files are included in Google's indexing and will be found by searches.

Canonical declarations

Bing seems to spider files specified in the META statements. pdf_indexing.html was on my PC server and the declaration was wrong.

I have corrected this but had not uploaded it at the time of writing this.

Parish Council

Many of the linked documents on the Parish Council website are in the format of PDF files. While this a convenient method of publishing there are limitations regarding the indexing of these pages.

The E-News pages are such PDFs. These can, and are, specifically submitted for index when they are uploaded. This is so they appear in Google (and other) searches within a few hours of upload.

The limitation is that there is not as much control of whether the search engine robots find the pages and index them without being asked. A NOINDEX directive cannot be added to the PDF but a previously indexes PDF can be requested to be removed from the Google index.

Under normal circumstances there is no problem with the PDFs appearing in Internet searches. However, when an edition of the E-News appears in a search result that references an earlier event that coould be confused with a current event it needs to be removed from the index.

In some respects all editions of the E-News other than the current edition could be requested to be removed.

A case in point is the November 2023 quiz night that when searched for returns a listing for the 1st edition of the E-News

Top

How to prevent your PDFs from being indexed

As I was saying there is no way that you can put a NOINDEX directive in a PDF but you can ask the spiders not to included them using a directive in your .htacces or httpd.conf files.

The reasons that you may wish not to have your PDFs indexed is that you have performed a lot of research to find information and to illustrate a page that you have written using that information it is counter-productive if the PDF is accessed directly, as you did when you found it (or possibly not as it was not indexed in the first place). What you really want is for the person finding the information to read your page, you are not providing a secondary "look-up service".

The code to place on the configuration file:

<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

The StackExchange post and answers seems to inply that this would work in .htaccess. I have placed the code in my .htaccess (16 April 2025). I will have to see if this gives any problems.

I am increasingly seeing PDF files being accessed directly. As such I am concluding that they are in the search engine indexes. I can remove them but this may prevent them getting there in the first place.

Removing the PDF from Google's index

The indexing of PDFs

Links