Blog posts tagged "discovery"

Similar Entries, an update

March 22nd, 2003

Crazy day running hither and yon, but I did spend a little more time hacking on similar entries. Stripping HTML made a huge difference in accuracy. However very short entries were all flagged as being very similar (which is true I suppose, they are all a blurb and some text, but this isn’t a very useful information). Term weighting as suggested in this LSI intro seems to have raised accuracy dramatically (but required me to lower my threshold by an order of magnitude). I still think I can improve the way categories are treated. Also I want to do something more intelligent with links then discarding their href info. No progress on surprising results. (maybe I could also include the least similar item?)

In the progress of doing this exercise I also noticed that I’m really bored with the content I’ve been writing of late. Ugh. I don’t want to read most of this stuff, what are you doing here? Did get me thinking though. You could use this same technology for a document browser allowing you to flag blog entries as boring, there by decreasing your chance of seeing similar entries.

Also I’m running the script to find similar entries manually on my laptop, as some other website on the server seems to be getting a lot of traffic. So there might be lags and stuff.

Tagged: Uncategorized ,

Similar Entries

March 21st, 2003

Way back in the frozen, blighted wasteland that was January this year, I was playing with building a “Similar Pages” functionality to be part of the proposed IMC open editing/annotations project, in the hopes of getting around the much noted fact that people suck at metadata. In the process I discovered my Linear Algebra is more then a little rusty.

The great (and sometimes immensely frustrating) part of the web is often if you wait long enough, someone else will do all the hard work. And so it was. When I finally got around to reading Maciej Ceglowski’s (lead troublemaker in the LSI revolution) article, Building a Vector Space Search Engine in Perl I found all my work had been done for me, and laid out in a clean Perl module.

Blog as test bed

Well an open editing server is still an idea on low simmer, but I’ve got a blog, so LM (and all you folks) are now the guinea pig. Please check the individual entry archives to see what I’m talking about.

Other options

There have been several Related Entries plugins released for Moveable Type, but I never found these very useful as they: didn’t match how I use categories, or would have required me going back and adding keywords to every entry. Besides they involved not a single matrix transformation.

Hacking

I love object-oriented programming. Using Maciej’s Search::VectorSpace with MT was almost as simple as subclassing in, and overriding get_words(). (In the end I had to make some minor tweaks to search() as well) 40 minutes and 40 lines of code later I was happily building vectors of MT entries, and calculating similarities. Building a script to call my new Search::VectorSpace::MT took significantly longer :)

Next Steps

I wouldn’t call the code release ready. Already I’m noticing some problems, like I should be stripping HTML before calculating similarity. (should have been obvious) Also I currently haven’t jumped through the hoops to make this a Moveable Type plugin, though I think maybe a cron’ed crawler, plus some liberal use of MT::PluginData might be the best way to go. In the mean time I need to tweak it a little bit, find the right threshold, figure out the best way to respect the metadata I do have, etc, etc. If you see any particular bad, or particular good results, let me know. Thanks.

update[2003/03/22]: yeah, its not working very well. i think i need to play with term weighting, that and strip that HTML.

Tagged: Uncategorized , ,