Text Summarization

July 28th, 2003

Well done text summarization is almost as useful (and cool) as machine learning. And Ted has had some great stuff on text summarization lately. We have libots, an open source text summarizer that comes out of the Abiword project, and the OS X summarization service. Followed up by Classifier4J, a Java implementation of Bayesian nets, which also includes a summarization engine. So how do they compare?

Classifier4j

Nick, author of Classifier4J compares the resulting output of the OS X summarization and that of Classifier4J. C4J compares quite favorably I think, providing better context. In particular I spent a while playing with pasting my own blog entries into TextEdit this afternoon, and the OS X summarization service seems to have some bias against the first sentence of a paragraph, the sentence I was taught to call the “topic sentence” in elementary school.

OS X Summarization

The OS X summarization service is available from TextEdit, Mail.app, or any other Cocoa app. It pops up a neat app containing your selected text, and a slider for length of summary. The summaries are pretty good, but leave a bit to be desired. However the really killer is this useful service doesn’t have an API! Oh sure, we could probably script it will AppleScript, but why not make it available right next the spellchecking API? I would be happy to be wrong about this.

Open Text Summarizer

Interested in how OTS stacked up with these other two (and whether perhaps I wanted to write some Perl bindings for it), I attempted to install it on my laptop. Without success. pkg-info kept complaining that I didn’t have libpopt 1.5 installed, correct, I had 1.6 installed. Futzed with it a bit, I give up, anyone with tips on getting OTS to build on OS X that would be great. I did however install it on my linux server, and here is the output for the same blog entry Nick used.

ots -r50 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take. It looks like you just dump all your information in there and turn it’s recognizers loose and it sorts it all out for you. One thing that I noticed while reading the pages is that Mac OS X has a text summarization service built in. If one of you MacOS X hackers can confirm this, that would be great.

ots -r30 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take. It looks like you just dump all your information in there and turnit’s recognizers loose and it sorts it all out for you.

ots -r10 sample.txt

John Robb linked to DEVONthink which is a free form information manager for MacOS X. It takes a less structured approach than Chandler is trying to take.

ots -a sample.txt

Article talks about “DEVONThink” “information” and “MacOS”

I can’t say I’ll all that impressed by the OTS summaries, it seemed to be simply grabing the first xx% of the article and using that as the summary. Perhaps OTS only works on longer documents? Perhaps its word counting algorithm doesn’t work so well? (when I get around to playing with C4J first hand, I can compare its algorithm, OS X’s however shall remain closed to us, alas). But I do like the online summary feature.

Uses?

Well I confess, I’m an RSS geek, and the first thing I thought of using a text summarization service for was for better entry/article excerpts to stick in the description field. It would be useful though for almost any CMS/web publishing tool that displays a list of content. An interesting overlap is OTS using word occurrences as the basis for its algo much like the LSI-alike vector space searching I use to generate the similarity listings on this blog. I wonder if they could be driven from the same datastore?

Tagged: Uncategorized

4 responses to “Text Summarization”

  1. […] Open Text Summarizer is both a library and a command line tool (developed by Nadav Rotem) that, well, summarises text. It is similar to the functionality incorporated into Microsoft Word and available in all native Mac OS X applications. The approach taken by OTS is to use word frequency to prepare a list of keywords and assign priority to sentences based on that frequency. It then outputs a summarised version of your text based on a ratio you supply —the default is 20%, i.e. the summary will be one-fifth the size of the original in terms of number of sentences. An automated process like this can never be perfect, and some texts are more amenable to auto-summarising than others. The reliance on sentences means that a well structured prose text works best, and that it should be somewhat substantial to produce meaning. Auto-summaries can be used as a basis for abstracts or catalogue descriptions, for article summaries in RSS feeds, or for checking keyword frequency for Search Engine Optimisation. Shorter texts, lists, and internally incoherent or structurally inconsistent texts will tend to produce gibberish —which can have its own amusement value. While the performance of OTS may not quite be up to the standards of proprietary alternatives (see this 2003 review), it is —as far as I was able to determine— the only available free or open source (specifically GPL) library for this purpose. […]

  2. jitendra bansal says:

    hello, i am doing project on automatic text summarization.Can you send me some good algorithem code which gives meaningful summary.if u can send me code of summarization algorithem than please send me. thank you.

  3. akanksha says:

    helloo….even m doin a project in text summarisation….kindly help me with the algorithms or codes if u can….thank u

  4. Girma says:

    Hi there , Open text summarizer is a available in C . It is architecture is not mentioned , can any one help me in accessing its documentation or any academic publication on it .