Well done text summarization is almost as useful (and cool) as machine learning. And Ted has had some great stuff on text summarization lately. We have libots, an open source text summarizer that comes out of the Abiword project, and the OS X summarization service. Followed up by Classifier4J, a Java implementation of Bayesian nets, which also includes a summarization engine. So how do they compare?

Classifier4j

Nick, author of Classifier4J compares the resulting output of the OS X summarization and that of Classifier4J. C4J compares quite favorably I think, providing better context. In particular I spent a while playing with pasting my own blog entries into TextEdit this afternoon, and the OS X summarization service seems to have some bias against the first sentence of a paragraph, the sentence I was taught to call the “topic sentence” in elementary school.

OS X Summarization

The OS X summarization service is available from TextEdit, Mail.app, or any other Cocoa app. It pops up a neat app containing your selected text, and a slider for length of summary. The summaries are pretty good, but leave a bit to be desired. However the really killer is this useful service doesn’t have an API! Oh sure, we could probably script it will AppleScript, but why not make it available right next the spellchecking API? I would be happy to be wrong about this.

Open Text Summarizer

Interested in how OTS stacked up with these other two (and whether perhaps I wanted to write some Perl bindings for it), I attempted to install it on my laptop. Without success. pkg-info kept complaining that I didn’t have libpopt 1.5 installed, correct, I had 1.6 installed. Futzed with it a bit, I give up, anyone with tips on getting OTS to build on OS X that would be great. I did however install it on my linux server, and here is the output for the same blog entry Nick used.

ots -r50 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take. It looks like you just dump all your information in there and turn it’s recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. If one of you MacOS X hackers can confirm this, that would be great.

ots -r30 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take. It looks like you just dump all your information in there and turnit’s recognizers loose and it sorts it all out for you.

ots -r10 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take.

ots -a sample.txt

Article talks about “DEVONThink” “information” and “MacOS”

I can’t say I’ll all that impressed by the OTS summaries, it seemed to be simply grabing the first xx% of the article and using that as the summary. Perhaps OTS only works on longer documents? Perhaps its word counting algorithm doesn’t work so well? (when I get around to playing with C4J first hand, I can compare its algorithm, OS X’s however shall remain closed to us, alas). But I do like the online summary feature.

Uses?

Well I confess, I’m an RSS geek, and the first thing I thought of using a text summarization service for was for better entry/article excerpts to stick in the description field. It would be useful though for almost any CMS/web publishing tool that displays a list of content. An interesting overlap is OTS using word occurrences as the basis for its algo much like the LSI-alike vector space searching I use to generate the similarity listings on this blog. I wonder if they could be driven from the same datastore?