Wednesday, February 27, 2008

Adding PDF icons to WSS 3.0-- and what went wrong when I tried it...

So I received a comment from someone on my "how to index PDFs" entry last month. Flattered, I went to take a looksee at what they wrote.

I was a little confused when the commentor mentioned that they had tried to make changes to the docicon.xml file and still couldn't get their search to work.

Hmmm, I don't remember touching the docicon.xml file when I enabled pdf indexing. I just downloaded and installed Adobe 8.x, added some registry entries, and did the necessary restarts to get sharepoint to index the files anew.

While trying to figure out why he thought docicon.xml had something to do with indexing filters, I realized that I forgot to do something important-- change the default pdf icon.

Ahh, that's what using the PDF ifilter from adobe has to do with the docicon.xml file-- to associate the correct icon for the pdf extension. I guess he thought the two processes were one in the same.

Yes, I know it's just a prettifyin' thing and not essential for search to work on PDFs, but what the heck, it's easy.

So if you don't want that pesky default blank paper icon to show up next to your PDF files in your libraries or search results, do the following:

  1. Download the icon file from adobe (or whereever, some people have better ones). Make sure it's small (the default from adobe is 17x17). http://www.adobe.com/misc/linking.html#pdficon

  2. Save the icon to the c:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\IMAGES folder. I renamed it "icpdf.gif" myself, just so it matches the format of all the other doc icon files used in the DOCICON.XML.

  3. Open the DOCICON.XML file in notepad (the DOCICON.XML file is located in the c:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\XML folder).

  4. Once in the DOCICON.XML file, go to the "<ByExtension>" section (see figure above) and add the tag: "<Mapping Key="pdf" Value="icpdf.gif" OpenControl=""/>"(minus the quotes, they're just there to tell you what you are supposed to type-- again, blogger hates it when you add non- html tags)
    The tag means that the icon "icpdf.gif" (or whatever you named your pdf icon file) will be mapped to the .pdf extension. I do not have an open control (like using Word to edit a doc file) for pdfs, so I left it blank between the quotes.

  5. Save the file, then drop to a command prompt and do an IISRESET. This should let sharepoint know there has been a change and repopulate the pages with new icons appropriately.

    --- Important note: I could not get the pdf icon to work for the longest time. So pay heed-- the capitalization of the words "Mapping Key" OpenControl" and "Value" in the DOCICON.XML file is important. I did not capitalize the word "value", and no matter what I did, the icon would not work. So when you are working in the DOCICON.XML file, capitalizing the text in the tags is important.----

Once you have done an IISRESET on the WSS server (where, of course, all of this is taking place), you should be able to go into the library where pdf files are listed and see the correct icon next to them:

And when you do a search, the pdf files in the results should show up with the correct icon as well:
So that's what the DOCICON.XML file is for, and how to add the correct little icon images to the file extensions you use in sharepoint. Thanks Ravie.

Thursday, February 14, 2008

How to use WSS 3.0 to search more than one site collection in one go

I have been working on some courseware for Microsoft, and in the process have come across something interesting.

Search queries the content database of a web application for data, right? And search is supposed to confine it's search to the site collection the user typed the query into, right?

But search does search the site collection path from the point where the query was made (be it at the top-level site, or at a subsite) downwards through the path.

So what if you were searching from the top-level site of the root site collection in a web application? Theoretically, all site collections from there are on its path and therefore available to be used for search results.

So, theoretically, searching from the top-level site of http://sp2/, you could search http://sp2/sales, http://sp2/marketing/document1workspace, http://sp2/sites/saffronsblog, etc.. Even though saffron's blog is a different site collection altogether it's still on the path.

What makes that possible (or generally impossible- garnering the standard that searches are site collection-centric)? Most site collections have different user accounts in them. They are usually user boundaries, created to give different people access to different data. But if you have an account on the root site collection of a web application that is also a member of the other site collections on the path, then that person can, in fact, do a search at the top-level site of the root site collection, and get results that are located in the other site collections.

Search is generally limited by site collection due to security filtering. The other site collections on the path are omitted because the user doing the query doesn't have the right to see those results-- not because search doesn't go out to those additional site collections.

What stops search from accessing additional site collections is the caveat that the account must also be a member of the other site collection(s) in order for this to work. But if you have someone, say an IT staff member, who is a member of other site collections in that web application, then they can do this cross site collection search trick.

Try it. Break the site collection search boundary. So far, over and over, it has worked for me.

And if this is a security issue, giving the wrong people the right to search where you don't want them to-- well, you shouldn't have made them members of the other site collections...