Friday, December 30, 2011

Frugal Admin, special edition: How to get your SharePoint Foundation 2010 server to index RTF files

Hi there,

In this special edition, I am going to tell you how to index rich text files (document files with the rtf extension).

(to see it done in action, go to http://www.livestream.com/callahanSPF4admins and watch "Enabling RTF indexing on SharePoint Foundation 2010")

Now I know, I know, you've got to be saying, "Callahan, how often does anyone need to upload a rich text file? I mean c'mon."

But hey, it can happen. How about having users that are working on different platforms and don't have Word installed? What if there is a piece of software on your network that puts out RTF files for some reason, and you need to have them in a library on your SharePoint site? Maybe your tech support site uses RTF files so they're compatible with everyone?

For whatever reason, it appears that there is a little something broken in the registry for SharePoint so it can't do something so simple, so assumed, as search rich text files.

You see, it all started when someone tweeted asking if SP2010 could index rtf files natively or if it "needed an ifilter" (meaning they'd have to go install one). I just so happened to be doing a lot of work with PDF ifiltering, so I was well qualified and ready to check into it.

I thought their question was sincere, so I started looking. It turns out that seconds after the question, someone tweeted back saying it couldn't be done.

Of course, I was busy digging, so I didn't know it couldn't be done.

And so..

...I did it.
(later I did find out that there is a book out there telling SharePoint Server people to just register the rtf ifilter DLL and it will work fine for them-- but that definitely doesn't work in SharePoint Foundation, and might've stopped me right there had I known...)

[for my tl;dr readers- the short form of how to get rtf ifiltering to work in SharePoint Foundation:
  • change the value of the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Search\Setup\ContentIndexCommon\Filters\Extension\.rtf to the correct DLL CLSID: {e2403e98-663b-4df6-b234-687789db8560}
  • run the AddExtensions.vbs script that you copy from the internet so it will permanently add an rtf extension to the extensionlist at key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Search\Applications\6519b45e-2869-4f5a-9bb5-ec60370309fb\Gather\Search\Extensions\ExtensionList
  • reboot server (you have to to get it to read the changes to the registry)
  • upload an rtf file that has at least one unique word in it to a library in SharePoint
  • then wait for search to run an index, or force a fullcrawl- when it's done, you'll be able to search your RTF by that unique word and have it show up in the search results.
And that's it. but to see why and how I knew to do this stuff, how to do it in step by step detail, and why it works for SPF, read on]


First I checked to make sure that SharePoint Foundation 2010 (SPF) could not, in fact, index RTF files by uploading one to a library, doing an iisreset, then a fullcrawl (stsadm -o spsearch -action fullcrawlstart --keep in mind to run that command in 2010, as opposed to early more security conscious versions, the account your logged in as must OWN the search database...). Then I did a search on the file name, which proves the full crawl worked. Finally I tried to search by text in the RTF file and had it fail- proving that RTF file indexing failed.

Once I knew it failed, I then went to the registry, because I knew that other than an ifilter's DLL, the settings in the registry were key to having ifiltering work in SharePoint Foundation.

Now, when using Adobe's PDF ifilter, I needed to go to the registry, add an entry to the "ExtensionsList" for applications, and a Extension key for .pdf with the correct CLSID pointing to Adobe's PDF ifilter DLL. These two things were critical for success.

So I checked to see if there were any entries for "rtf" in the same places in the registry. I found something interesting.

There was no listing for "rtf" in the ExtensionList key (see figure below for details- the full path in the registry is listed at the bottom of the window). I've been given to believe (and I am correct) that an ifilter won't work for SPF without a listing for the file extension here.

Then I went to check the second registry entry I'd learned was important, a key under Setup\ContentIndexCommon\Filters\Extension. Each file type that SharePoint Foundation can possibly search is listed here with it's own key. The key contains, at the minimum, a default value that is the CLSID of the DLL used by the ifilter for that file type. RTF did have a key.

To be thorough, I wanted to know what DLL that value was pointing to. It should be the CLSID for the file's ifilter DLL.

To check that I selected the CLSID key under HKEY_CLASSES_ROOT and did a find (go to Edit on the menu bar, and click Find, or use ctrl+f keys) for the CLSID value listed for the rtf extension ({35500004-002C-0000-0000-000000000000} as it happens to be). What came up was the plain text filter's CLSID not the one for rich text files:





Every CLSID key for an ifilter has to have an InProcServer32 sub-key. It will list the path to the DLL for that ifilter. In this case, to really prove it has nothing to do with rich text, the InProcServer32 sub-key's path goes to tquery.dll-- the dll used for simple, plain text indexing.


I thought that couldn't be right. It looked like the wrong CLSID for the rtf key for ifiltering had been entered by the SPF installer during setup.

And I figured, if that was the case, I just needed to find the rtf ifilter, if it existed by default (which I had to assume it did, I mean, really), and use it's CLSID instead.

So I went back up to the CLSID key under HKEY_CLASSES_ROOT, and did a Find for "RTF Filter". Why, you ask, did I know to use those exact words? Because the name for the CLSID for the PDF ifilter was PDF Filter, so I figured it would probably be like that for rtf.

And I found it. The value for the rtf ifilter was: {e2403e98-663b-4df6-b234-687789db8560}





Also notice in the picture that the DLL for the rtf filter is "rtffilt.dll". During all this I'd also looked on the internet to see if anyone had been trying to use an rtf ifilter. There were blog entries and forum posts about getting rtf ifilters online, downloading them and using those, and few for SharePoint except, ironically, two for SharePoint Search Express. One refers to a DLL that Microsoft apparently published several years ago named "rtffilt.dll" (now it appears built into server 2008 R2) and one that actually had you register a DLL that was already in system32, so I knew the file already existed on the server.

(to note: however, the blog entry that registers the DLL does something interesting, it has you copy the file from system32 to the sysWOW64 folder and register both: http://thetrainndt.posterous.com/?tag=ifilter Just mentioning it in case your system requires that for some reason- not sure why you would...)

Anywho, obviously, the correct CLSID for the existing rtf ifilter is the value I listed before the picture.

So I copied the correct CLSID value (I right clicked the CLSID key on the right side of the window, and selected "Copy Key Name"), then went back to the rtf Extensions key under ContentIndexCommon\Filters\Extension and changed it's value to the correct one (never forget the curly brackets have to be on either end of the alphanumerics) by pasting the key name. You'll have to delete some of the key information so only the CLSID remains.

Once that was done, I needed to add the .rtf extension to the Applications\Gather\Search\Extensions\ExtensionList (we checked that earlier in this entry, and it was missing). Now these extensions are numbered, so we have to add a string value of the next higher number (in my case that'd be 49, in yours it'll probably be 48). Then double click the value to enter "rtf" (without the quotes of course) as the value.

However, I have found that, with server 2008 R2 (especially with all the most recent updates and service pack) that ExtensionList key is protected, and no matter what I do (take ownership of the key, subkeys, etc., for example), the change is deleted in a few hours or on next reboot.

To overcome this, there is a simple visual basic script you can run to override that behavior and "register" your extension correctly in the ExtensionList. It won't disappear and it won't delete after reboot.

The easy way to get that script is to go to http://support.microsoft.com/kb/2518465 . In that KB article is the text for the visual basic script- just copy and paste it into a text file (if you don't feel like going to the KB, here it is for your convenience):
---------------------
Sub Usage

    WScript.Echo "Usage:    AddExtension.vbs extension"
    WScript.Echo

end Sub

Sub Main

    if WScript.Arguments.Count < 1 then
                Usage
                wscript.Quit(1)
   end if

    dim extension
    extension = wscript.arguments(0)

    Set gadmin = WScript.CreateObject("SPSearch4.GatherMgr.1", "")

    For Each application in gadmin.GatherApplications
        For Each project in application.GatherProjects
                    project.Gather.Extensions.Add(extension)
                Next
    Next

End Sub

call Main
-------------------

Once I copied the text above into a text file, I saved the text file as AddExtension.vbs (make sure you select All Files *.* for the "Save as Type" field, so it doesn't save the file with a txt extension anyway). Always pay attention to where you save files, it comes in handy later.



That script has to be run in order to make the necessary change in the registry. That's why I needed to know where the script was saved. So I opened an explorer window and browsed to the location where I put the new vbs file. Then I shift+right clicked in the window and selected to Open command window here.


I then entered the following command in the command prompt window and hit enter (of course):

wscript AddExtension.vbs rtf



That ran the script and added the correct entry in the registry, which now won't disappear if I reboot.  Which is good, because after the script runs, you have to reboot the server to get it to read the change (I know, that sucks, but at least you know for certain that it's necessary).

--You can confirm if the command ran by trying to run it again- it should give you a warning dialog box saying the object already exists. You can also go into the registry and check for a value in the extensionlist at key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\14.0\Search\Applications\6519b45e-2869-4f5a-9bb5-ec60370309fb\Gather\Search\Extensions\ExtensionList. If it's there, then the script worked.--

Once the server rebooted, I needed something to test to confirm if rtf ifiltering would work. So I uploaded a rtf file with unique text in it:



Then I ran a full crawl (you can wait for the server to do it itself).

An example of how to do that using STSADM:

stsadm -o spsearch -action fullcrawlstart

[remember that to use a PowerShell or STSADM command to do a full crawl with SharePoint 2010 be sure the account you are logged in with owns the search database (yeah, I kid you not)]

People may say you need to restart the search service (net stop spsearch4 then net start spsearch4) before doing the full crawl, but that is not necessary- rebooting the server, by definition, restarts the service.

To test if the full crawl worked, after the master merge has been completed (you can see two entries in the Applications Event log under the category "Content Index Server"), I went to the SharePoint site where I uploaded the RTF file, and did a search using a word in the title of the file. When it came up in the search results, I saw two things. 1) it proved that the full crawl was successful, because SharePoint was at least able to index the metadata for the file's title. 2) if under the title of the file in the search results, a little summary of the text in the file is displayed, then SharePoint was able to index the content inside the file, meaning the rtf ifilter did work.


And, of course, the true test- doing a search on the site where the file is located, using one of the unique words in the rtf file itself- if returns the rtf file in the search results, then it worked. And in my case, it did.

So the bottom line:


-Do not let anyone tell you that SharePoint Foundation 2010 cannot index/search RTF files. It can. Out of the box, with only two registry entries and a reboot.
-Do not let anyone tell you that you must BUY and install an RTF ifilter in order to be able to index RTF files. Spending money is NOT necessary, the file should already be in the system32 folder.
-The suggestions made to get SharePoint Server 2010 to index RTF files (namely, just registering the rtffilt.dll) do not work for SharePoint Foundation 2010. Just because that fix doesn't work for SharePoint Foundation does not mean SPF cannot search rtf files. That's just silly, and I've proven it. Thanks for reading this far. :)

No comments: