Crazy day running hither and yon, but I did spend a little more time hacking on similar entries. Stripping HTML made a huge difference in accuracy. However very short entries were all flagged as being very similar (which is true I suppose, they are all a blurb and some text, but this isn’t a very useful information). Term weighting as suggested in this LSI intro seems to have raised accuracy dramatically (but required me to lower my threshold by an order of magnitude). I still think I can improve the way categories are treated. Also I want to do something more intelligent with links then discarding their href info. No progress on surprising results. (maybe I could also include the least similar item?)
In the progress of doing this exercise I also noticed that I’m really bored with the content I’ve been writing of late. Ugh. I don’t want to read most of this stuff, what are you doing here? Did get me thinking though. You could use this same technology for a document browser allowing you to flag blog entries as boring, there by decreasing your chance of seeing similar entries.
Also I’m running the script to find similar entries manually on my laptop, as some other website on the server seems to be getting a lot of traffic. So there might be lags and stuff.