Searching PDF with ht://Dig

Filed under: Uncategorized. Tags: . | Leave a Comment 

I’ve just enabled indexing and searching of .pdf documents on the Learning Commons website.

We’re using ht:/Dig as our search engine, and it’s quite flexible. It can take external parsers to teach it to read non-text-only file formats. There are libraries available that can teach it to read .rtf, .pdf, .ps, .doc, .swf, .xls, and even .ppt files.

For now, I’ve only added the .pdf parser, using the Xpdf library. There was no binary available for MacOSX, so I had to compile from source. Here’s a link to the compiled binaries for MacOSX (compiled without support for the X11 windowing system – these are just the command line utilities). Just drop them in /usr/local/bin and enjoy!

ht://Dig Website Search Engine

Filed under: Uncategorized. Tags: . | Leave a Comment 

We’re in the midst of reworking the Learning Commons website, and one of the changes is dropping to static files for most of the site (rather than the dynamically generated site we use now). One major thing we change by doing this is the software to search the site.

I’ve just installed ht://Dig on commons, and it seems to work quite well. I had to compile from source, which I couldn’t do on commons itself for some reason (no dev. tools installed on MacOSX Server 10.3?) – I compiled on my TiBook and moved the binaries etc. to commons after testing that it worked.

The package comes with a plain-vanilla test search page, which shows it works quite well. I’ve had to tell it to not index session-based pages generated by WebObjects applications (ignore urls with “wosid=” in them) so it doesn’t wind up with an infinite number of unique and invalid pages in the index.

Anyway, ht://Dig seems to be used quite a bit Out There, and it seems to work quite well In Here. I’ve set up a crontask to re-index the server every morning at 3am.

ht://Dig logo