Saturday, January 5, 2008

And now for something completely different-- Searching PDFs, or Using Adobe's PDF IFilter with WSS 3.0 sp1

(Actually it doesn't matter about the version of WSS 3.0 (whether or sp1 is applied), what matters is the pdf ifilter. )

You may have installed WSS 3.0 on a server or two and had no problems with Search. It works great, indexing content on a tidy schedule with no mishaps-- users can search for anything in a site collection, from announcements to the contents of big Word documents. But, when the users try to get fancy and introduce PDFs to the mix, things get tricky because WSS doesn't search the content of those.

You see, WSS 3.0 can search standard Windows File types (meaning Office file types and text files mostly). Supposedly it can also, out of the box, identify characters in OCR'd TIF files. However, it cannot search PDF files.

Why? Because MS only offers index filter files (files that teach the WSS indexing service how to gather data ) for their file types.

However, you could get an index filter file (called an ifilter) by downloading it from adobe if you wanted to be able to index PDF file contents.

And that worked for versions 5 and 6, but when version 8.0 came out, things changed. You see, you used to be able to download the older ifilters straight from Adobe, but suddenly, you can't for version 8.0.

Why? Because the ifilter file is now bundled with Adobe Reader. So to get the ifilter, you have to install Adobe Reader (8.0 or higher) onto the WSS server that will be doing indexing.

However, not a lot of people know that. Which is why there are people in the public newsgroups having problems indexing their pdf files in WSS 3.0 (or higher). So either people don't realize that they need a pdf related ifilter or people do download and use the older Adobe PDF ifilter files, thinking that'll do it.

But people who download the older ifilters find they can only index PDF files of that version and lower, not non-adobe pdfs or higher versions. To index older version, non-pdf, and newest versions of PDF, you need the newest version of the ifilter (currently 8.1.1)

Of course, a company called FoxIt capitalized on people's confusion about getting and using ifilters by offering their PDF IFilter-- for a pretty penny of course.

But Adobe is still offering their ifilter for free. The only price you pay is having to install Adobe Reader on the server. And if that is too steep, then, well, it's good to know now before going any further.

There are a few tricks to getting the ifilter to work with WSS. Basically WSS needs to know it's there and what extension to use it on, so a few registry changes are in order.

Namely you will have to edit the registry to add the PDF file type to the Extensions List for WSS search, and to map the extension to a particular ifilter.

To do that go to regedit (go to Start Menu>Run> type regedit, hit Enter). Once in the registry, open the key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\{ANYGUID}\Gather\Search\Extensions\ExtensionList

The extension list is full of all the extensions that the indexing service (the Gatherer as it were) should recognize, listed as the string values of consecutive numbers, containing value data that indicates the extension.

To add PDF to the list, you simply find the highest number in the list (it goes in order, 1,2, 3, 4... up to the last of them), add a String Value that is the next higher value (so if the highest value was 37 for example, the string value you would add is 38), and enter "pdf" for the Value Data.












Then go to the next key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension

Here should be listed the file extensions with a CLSID (class ID) for the ifilter used to index the extension. If .pdf is not listed, add it (it should have a multi string value). In that multi-string value, you need to add the CLSID for the ifilter added by Adobe Reader. This file is called, for version 8.1.1. "AcroRDIF.dll." You can look up it's CLSID by doing a Find under the CLSID key (under HKEY_CLASSES_ROOT).

[[ Edited to add-- the CLSID that you are looking for to do filtering is under HKEY_CLASSES_ROOT\CLSID\ --I mention this because there are other CLSIDs that relate to other functions for acrobat in the registry. The one we need has the Reg SZ value of "PDF Filter".]]

Or, because, conveniently, the CLSID is posted on the web in several places because it is the same for version 8.1.1. as it is for the 8.0 version of Adobe Reader, you can just type it in:

{E8978DA6-047F-4E3D-9C78-CDBE46041603}

Mind you, if you are using a different, newer version, this CLSID may not work and therefore you'll need to find out the ifilter file name for your version and then search for it in the CLSID key in the registry. I find using the true filesystem path and the file name is better than just using the file name, sometimes the dll can be listed in a few places. You may need to experiment.

Anyway, I digress-- Once you've either found or added the .pdf key, enter the CLSID for the value (be sure to include the fancy brackets).












To let the server know where the Adobe Reader executable and its associated files are, add its path to the evironment variables of the server.

(Start Menu>right click My Computer>select Properties>go to the Advanced tab>click on the Environmental Variables button and scroll down the Path variable>select it and click on the Edit button> and add the path ";C:\Program Files\Adobe\Reader 8.0\Reader" (be sure to use the correct version if you are using something newer than 8.x)>then click OK to apply and close)

Finally, to let WSS know that it needs to index PDF files now, you can do one of two things:

1) reboot the server (seems to always work, but may not be possible in your environment). Instinctively, I guess because I am old school, I reboot when making changes to the registry.

OR

2) First stop and restart the Windows SharePoint Search Service (at the command prompt, use "net stop spsearch" then "net start spssearch." And then force the index service (if you don't want to wait for it to index on its preset schedule) to do a fullcrawl by, using STSADM:

stsadm -o spsearch -action fullcrawlstop

(again that's a bit of instinct there, I am figuring if it happens to be in the middle of a crawl I want it to stop and start over using the new ifilter)

stsadm -o spsearch -action fullcrawlstart

Remember that the stsadm command is in the C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\BIN folder (if it isn't already set as a an enviromental path variable).

Regardless of which you choose, it may take some time for the pdf files to be indexed properly. I have found the second option to not be as guaranteed to work as simply rebooting and waiting for it to index on its own. But, no matter how long it initially takes, I found this free and relatively easy solution to indexing PDF files with WSS 3.0 to work, every time.

18 comments:

Anonymous said...

Hello,
I tried this.
I have Adobe 8.1.2 installed, the extension in registry set, so in WSS/MOSS (docicon.xml), the CLSID in ...ContentIndexCommon\Filters\Extension\.pdf equals the CLSID which can be found at AcroRDIF.dll, and the environment variable set. It is completely different from that one you mention, is that all right or is it supposed to be a sort of similar?
Even after full recrawl, I am unable to see strings from pdf.
Do you have any idea what might I have done improperly?
Otherwise, thank you for the article, it's useful.
Regards
Ravie

Anonymous said...

Eh, the "It is completely different..." refers to "CLSID". Sorry :)
Ravie

Callahan said...
This comment has been removed by the author.
Callahan said...

(yikes, I was deleted...)

Let me try that again-- Ravie, I don't know why you were even in docicon.xml. That has nothing to do with search, only with what icon is associated with what extension.

Did you add the correct registry entries? Did you rerun the indexing service?

Callahan said...

Ravie, on further perusal, I have noticed that my docicon.xml file has no CLSIDs in it. That kind of concerns me. But I can assure you that, with acrobat 8.1.2, the clsid {E8978DA6-047F-4E3D-9C78-CDBE46041603} is still correct.

You do realize that if you are using MOSS 2007, that there may be other steps you need to do in the GUI for it to work? MOSS allows for more control of search and indexing in the interface, and as such, may require more administration.

Anonymous said...

Hello Callahan,
my CLSID assigned to the file AcroRDIF.dll was different than that you suggest.
I added this CLSID into registry under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdf.
The acrordif.dll CLSID can be found under HKEY_CLASSES_ROOT several times.
Now, I have browsed the registry further and found the acrordif.dll with another CLSID that equals that of yours. Therefore I am confused - how could I know what CLSID should have been used?
Btw, the CLSID should not be present in docicon.xml, this is only for adding icons to WSS/MOSS as you say. Sorry for confusing you :).
The only difference between WSS and MOSS regarding this I know about, after two MCTS certifications, is that search should be re-run via stsadm -o spsearch... in WSS and via SSP Full Crawl in MOSS.
Now, after full recrawl of the MOSS site collection, I tried to search for a pdf file saved as pdf from Word 2007 and was successful.
Thank you very much for the article and help.
Kind regards
Ravie.

Anonymous said...

One more question, is it possible somehow to be able to search scanned pdfs? I mean a sort of OCR function. I wonder if you had any experience with this?
Thank you once again.
Ravie

Callahan said...

Hey Ravie,

Thanks for clearing up the docicon.xml thing-- I was unclear what the two had in common and was pretty worried that you'd found CLSID's there...

As for the fact that, if you search the registry, there are other CLSIDs listed for Acrobat pdf ifilter. That's why I suggested which to search for (and why I explicitly listed the correct one). However, I should probably been more clear that you can get confused in the registry by having other CLSIDs listed. It is hard to know.

Also, in MOSS, you can end up having to configure what content to crawl, and where, if you haven't set it up properly to begin with. I've seen it happen and it can increase your troubleshooting time because of the additional hoops to jump through. To restart the indexing service though, it is just a STSADM operation away...

I am glad that it worked out. My apologies for your concerns in the registry. Messing with it can be tricky business-- and I have to wonder why MS hasn't made that easier to do.

Callahan said...

Ravie,

As for scanned OCR PDFs-- I have never done it-- but MS insists that it can read OCR TIFs. That's right, not PDFs, but TIFs. I have seen neither done.

My suggestion (one you've probably already thought of) is to try it and see. Can't hurt. : )

Anonymous said...

Hello Callahan,
I'll try the TIFFs and let you know :).
Regarding differences between WSS and MOSS, in my post I did not include the possibility that search has not been configured so far :). Anyway, it's true that MOSS is a bit more complicated.
Have a nice day!
Ravie.

dengel said...

RE Clsid's-
When I seark HKEY_CLASSES_ROOT for AcroRDIF.dll, I get pointed at a registry entry with this value:
{789AD2D7-E1C2-4EC7-A049-2DB5BB4CB57A}

However, Acrobat already placed a Clsid under:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdf
with this value:
{4C904448-74A9-11D0-AF6E-00C04FD8DC02}

Any idea what I should have in this key, or should I just leave it alone?

Don.

Callahan said...

Don,

you are absolutely right that the registry has several clsid numbers listed (I mentioned that that could be the case).

I too have the 789... clsid as well as several others. (one of my hkey local machine ids starts with an 8215BA...), but if you notice, they do not have "PDF Filter" listed as their Reg SZ data value. (in my case 8215BA... has the value of
"PSFactory buffer" and my 789... value is blank).

What you need to find is the key with "PDFilter" as the Reg SZ type, with the inprocserver32 value showing as the true file system path to the acrordif.dll file. It should be the one in the HKEY_Classes_Root\CLSID\ path in the registry.

So-- Don, do not change any of those other values. They are supposed to be there to do other things than pdf ifiltering. The value needed for the Ifilter for Adobe Acrobat 8.0 or higher (so far) is the one I listed in the blog post.

And, because you brought this up I am going to edit the blog entry to include this, in case it confuses any one else. As a matter of fact, I think Ravie mentioned this issue briefly in his comment as well (saying his number was completely different-- and he is right, there are several clsids for adobe that are completely different).

So thank you both for metioning it. I hope this helps.

Anonymous said...

A little bit later than promised :), I am posting a comment regarding TIFF OCR. Tiff OCR works, so does PDF, but not PDF OCR so far. Search results include special (national) characters as well.
Ravie.

Anonymous said...

I wonder if you happen to know about the PDF iFilter 8 - 64-bit Support with Windows SharePoint Services x64 on Server 2008 x64, what are the edits registry edits required? Adobe has a document on this install, but it is only for SharePoint Server. The Adobe forum (http://www.adobe.com/cfusion/webforums/forum/categories.cfm?forumid=72&catid=654) is a little lacking in responsiveness.

Jennelle Crothers said...

I love handy posts that explain exactly what I need to do. I've been spending a lot of time with SharePoint (WSS 3.0) lately and your posts and the book have been great!

Heinz Ruppert said...

Hello,
I've installed Adobe Reader version 9 on our WSS Server and the GUID you need to crawl into PDF files is: {8215BA54-B69F-4275-AE11-31CB63593B09}. Don't forget to stop and start the spsearch service and do a fullcrawlstop and fullcrawlstart after changing the registry entry.
Much luck
Heinz

Callahan said...

Hi Heinz,

Thanks for the update concerning the GUID for the newer version of Adobe Reader! And definitely, always remember to restart the search service so it realizes the change has occurred. : )

-callahan

Anonymous said...

Worked well, thanks