Way back in the frozen, blighted wasteland that was January this year, I was playing with building a “Similar Pages” functionality to be part of the proposed IMC open editing/annotations project, in the hopes of getting around the much noted fact that people suck at metadata. In the process I discovered my Linear Algebra is more then a little rusty.
The great (and sometimes immensely frustrating) part of the web is often if you wait long enough, someone else will do all the hard work. And so it was. When I finally got around to reading Maciej Ceglowski’s (lead troublemaker in the LSI revolution) article, Building a Vector Space Search Engine in Perl I found all my work had been done for me, and laid out in a clean Perl module.
Blog as test bedWell an open editing server is still an idea on low simmer, but I’ve got a blog, so LM (and all you folks) are now the guinea pig. Please check the individual entry archives to see what I’m talking about.
Other optionsThere have been several Related Entries plugins released for Moveable Type, but I never found these very useful as they: didn’t match how I use categories, or would have required me going back and adding keywords to every entry. Besides they involved not a single matrix transformation.
HackingI love object-oriented programming. Using Maciej’s Search::VectorSpace with MT was almost as simple as subclassing in, and overriding
get_words(). (In the end I had to make some minor tweaks to
search()as well) 40 minutes and 40 lines of code later I was happily building vectors of MT entries, and calculating similarities. Building a script to call my new Search::VectorSpace::MT took significantly longer
Next StepsI wouldn’t call the code release ready. Already I’m noticing some problems, like I should be stripping HTML before calculating similarity. (should have been obvious) Also I currently haven’t jumped through the hoops to make this a Moveable Type plugin, though I think maybe a cron’ed crawler, plus some liberal use of MT::PluginData might be the best way to go. In the mean time I need to tweak it a little bit, find the right threshold, figure out the best way to respect the metadata I do have, etc, etc. If you see any particular bad, or particular good results, let me know. Thanks.
update[2003/03/22]: yeah, its not working very well. i think i need to play with term weighting, that and strip that HTML.