This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: Desktop Software

Condensing with Open Text Summarizer

By Bruce Byfield on December 15, 2008 (7:00:00 PM)

Share    Print    Comments   

Properly speaking, Nadav Rotem's Open Text Summarizer (OTS) is not a summarizer at all. True summaries generally involve rewording contents at a higher level of generality while preserving the meaning, not just producing a condensed version of the original the way that OTS does. However, within its limits, OTS is an efficient tool for automatically producing abstracts of non-fiction, that, in the last 15 months, has received favorable mention from at least four academic publications, including one in which it outperformed similar utilities, including commercial ones such as Copernic and Subject Search Summarizer.

OTS is available as a command-line utility in Debian, Fedora, Gentoo, Mandriva, and Ubuntu packages. It is also available as a plugin in the latest versions of AbiWord. A gedit plugin is also being prepared, according to Rotem.

OTS removes common words, such as articles like "the" or "a" or conjunctions like "and" and "but," from consideration by using a dictionary list that accompanies the utility. Conversely, words that occur most frequently in the text are assumed to be the topic, while the sentences that have the highest percentage of the most frequently occurring words are the ones that are used in the output.

For greater accuracy, OTS also references grammatical rules, so that it does not assume, for instance, that the period used to indicate an abbreviation marks the end of a sentence. Similarly, OTS uses the Porter Stemming algorithm so that variants of the same word, such as "run," "ran," and "running," are grouped together in the frequency count. According to Rotem, Porter Stemming is about 90% accurate, which in turn makes OTS more accurate.

Using Open Text Summarizer

You can use the command-line version of OTS for plain text files, including HTML files, although the output for HTML files inevitably includes tags. A complete man page is available on the project site, but the Debian package, at least, does not include it, which means that you have to rely on the command ots -? or ots --help to see the options.

The basic command, ots inputfile, prints the output to the terminal. If you prefer, you can save the output to a file with ots --out=outputfile inputfile.

By default, the output file is 20% the length of the input file, based on the number of sentences in the input file. You can use the --ratio=percentage option to adjust the length of the output.

Adding the option --html produces output in HTML. If you want keywords to use as meta tags, the --keyword option is deprecated, but you can use --about to get much the same result.

You can change the default dictionary of excluded words using --dic=filename. Unless you are an expert in the field, you are unlikely to improve on the dictionary installed with OTS, but you might possibly want to exclude words specific to an area of expertise that you know are unlikely to be the topic of your input passages.

With the AbiWord plugin, you have fewer options, but all you need to do is select Tools -> Summarize, and choose the percentage length of the output file, and the result is entered into a new, unnamed file.

The results

Whatever the form in which you use OTS, the usefulness of the result depends partly on the content of your input file. In general, OTS works well with academic articles and news stories, making it a useful tool for those who need to write abstracts of the sort seen on portal Web sites or annotated bibliographies. You might want to tweak the results to provide a true summary rather than a condensation, but, even so, using OTS requires less time and involves less active thinking than writing a summary from scratch.

With other content, OTS is less successful. In my testing, its results are only fair with fiction, probably because the repetition in fiction does not necessarily indicate the important points. For the same reason, bullet lists of unorganized points do not always condense successfully, and, if you try to summarize a song, a chorus will often be featured in the output at the expense of the content of verses.

Outside of these limitations, Open Text Summarizer performs satisfactorily. It certainly compares favorably to the AutoAbstract feature in OpenOffice.org Writer, which is based -- rather pointlessly, so far as accurate results are concerned -- on style heading levels. So long as you are aware of its limitations, and check the results before you use them, OTS is a minor but useful addition to the arsenal of free software tools.

Bruce Byfield is a computer journalist who writes regularly for Linux.com.

Share    Print    Comments   

Comments

on Condensing with Open Text Summarizer

Note: Comments are owned by the poster. We are not responsible for their content.

Condensing with Open Text Summarizer

Posted by: Anonymous [ip: 209.30.144.213] on December 15, 2008 08:58 PM
Summarized at a 30% ratio:

Properly speaking, Nadav Rotem's Open Text Summarizer (OTS) is not a summarizer at all. True summaries generally involve rewording contents at a higher level of generality while preserving the meaning, not just producing a condensed version of the original the way that OTS does. However, within its limits, OTS is an efficient tool for automatically producing abstracts of non-fiction, that, in the last 15 months, has received favorable mention from at least four academic publications, including one in which it outperformed similar utilities, including commercial ones such as Copernic and Subject Search Summarizer. You can use the command-line version of OTS for plain text files, including HTML files, although the output for HTML files inevitably includes tags. If you prefer, you can save the output to a file with ots --out=outputfile inputfile. By default, the output file is 20% the length of the input file, based on the number of sentences in the input file. With the AbiWord plugin, you have fewer options, but all you need to do is select Tools -> Summarize, and choose the percentage length of the output file, and the result is entered into a new, unnamed file. You might want to tweak the results to provide a true summary rather than a condensation, but, even so, using OTS requires less time and involves less active thinking than writing a summary from scratch.

#

Debian is becoming conspicuously silly

Posted by: Anonymous [ip: 82.192.250.149] on December 15, 2008 09:15 PM
"A complete man page is available on the project site, but the Debian package, at least, does not include it"

Debian is notorious for making up its own rules about what is "free enough", and excludes a lot of useful material as a result. I'm all for software freedom, but we have to develop a consensus about what is "free enough" and what is not. It does not make sense for every little group to have its own incompatible set of rules.

As far as most of us are concerned, the Free Software Foundation fulfils the role of an informal standards body in this matter. If the FSF, with its legal advisors, comes up with the GPL or the GFDL then I think most of us are willing to see those as "standards". They might not be 100% perfect, but we live in an imperfect world, and cooperating with people is about compromising and getting along.

When Debian tries to kick over the traces and ram some other version of freedom down our throats then, frankly, I think Debian needs to have some common sense hammered into its collective head.

Coming back to the topic, OTS: I can't see why even the most arrogant Debian purist would find anything to object to in the license for the OTS man page. We can use it, modify the technical information, give away or sell direct or modified copies, etc. But then, I don't take too much notice of Debian's criteria for Free Documentation.

#

Re: Debian is becoming conspicuously silly

Posted by: Anonymous [ip: 85.29.96.232] on December 16, 2008 05:32 PM
The version of OTS that shipped with Debian Etch (0.4.2+cvs.2004.02.20-1.1) still had the man page. I don't see in the changelog anything about removing the man page, so I think it's just an unintentional packaging bug that has nothing to do with Debian's criteria for free documentation.
http://packages.debian.org/changelogs/pool/main/o/ots/ots_0.5.0-2/changelog

In general, Debian has the best documentation I've seen in any distro. If some application or command lacks a man page, the Debian package maintainers usually write the missing man pages themselves. If Debian removes some documentation from its main package repository, they provide this documentation as a separate package in their non-free repository. Debian developers care for software freedom -- this is the reason why Debian exists and it's what makes Debian special. If you don't care about software freedom, then Debian is clearly not the ideal distro for you.

But, I repeat, the lack of the man page in the OTS package that Lenny ships looks to me like an unintentional packaging bug. I'm pretty sure that if someone reported this bug, it would be soon fixed.

#

Nice try

Posted by: Anonymous [ip: 82.192.250.149] on December 16, 2008 08:37 PM
"If you don't care about software freedom, then Debian is clearly not the ideal distro for you."

That's a straw man, irrelevant to anything I wrote. If you had read my post more carefully, you would have noticed that I care deeply about software freedom.

That's precisely why Debian's silliness annoys me.

#

Re: Nice try

Posted by: Anonymous [ip: 85.29.102.121] on December 17, 2008 09:40 AM
Sigh. You just don't get it, do you? This "Debian's silliness" nonsense you keep ranting about only exists inside your own head and nowhere else. Most likely the removal of the man page from the OTS package in Debian Lenny was a mistake that the package maintainer has made inadvertently. If it was done on purpose, it would have been mentioned on the changelog.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya