This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature

KDE 4's Sonnet will turbocharge language processing

By Nathan Sanders on February 07, 2007 (8:00:00 AM)

Share    Print    Comments   

With the Sonnet library for KDE 4, developer Jacob Rideout hopes to reinvigorate the field of desktop linguistics by adding automatic language detection and other innovative features. Sonnet is to be for KDE 4 what KSpell 2 is for the current version of the K Desktop Environment, providing spellchecking facilities to applications as diverse as the Konqueror Web browser, Kopete instant messenger, and KWord office software. Unlike KSpell, however, it will also provide grammar checking, multilingual tools, and perhaps even translation, dictionary, and thesaurus functionality across all of KDE.

KDE 4 may take even the spellchecking feature a step further than it has in the past. Rideout says, "There is currently a discussion in the mailing list on enabling spellcheck for every textedit box in KDE." Sonnet will also provide text statistics such as word counts and readability scores. Rideout hopes to eventually implement automatic text completion as well.

Because Sonnet is a library accessible to all KDE applications, Rideout foresees applications beyond text editing programs. Its language detection feature is particularly ripe for unexpected usage. Sonnet is capable of determining the language a text is written in given about 20 characters of data. This feature already works for several dozen languages. According to Rideout, the Strigi desktop search developers are considering integrating language detection into their application's search features. Perhaps users will, one day, be able to search for "documents written in Spanish within the past week."

Rideout, who recently earned his bachelor's degree in linguistics, says that improved multilingual support is the "most requested change" from KDE 3 and it is here where language detection has the most potential. He says, "Users will be able to have documents checked for correctness in a fine-grained manner. Any separate section of a document (by default, this means a paragraph) will be checked in its respective language by the tools available for that language. For convenience, each section will have its language detected automatically, with the option of a user disabling or overriding the detection."

Without Sonnet's language detection, a French user who must frequently correspond with British associates must manually change the language library used by his spellchecker each time he switches languages. In KDE 4, Sonnet will automatically notice that he has begun typing in another language and check for spelling errors accordingly. In a more complex scenario, the user may quote paragraphs from an English speaker within an email he is writing in French. In KDE 3, the English portion will be interpreted by KSpell 2 as horribly misspelled French and continuously underlined in red. With Sonnet, the English text will automatically be spellchecked in its own language, just as the French is in its.

"Language detection in Sonnet was initially based on a Perl script named Languid created by Maciej Ceglowski," Rideout reports. "I ported the Perl script to C++ and have been regularly modifying it so that, while it shares the same algorithmic approach, it is no longer a direct port.... Sonnet was initially 'based on' Languid, but never used any of its code, and now has diverged in several significant ways.... "[Languid's] author has kindly granted KDE a license using the LGPL so that our derivative could be maintained and distributed as part of KDE's libraries."

Languid is fundamentally based on a technique called "N-Gram-Based Text Categorization" published by William Cavnar and John Trenkle. A gram is a segment of text made of N number of characters. Sonnet uses trigrams, made from three characters. By analyzing the popularity of any given trigram within a text, one may make assumptions about the language the text is written in. Rideout gives an example: "The top trigram for our English model is '_th' and for Spanish '_de'. Therefore, if the text contains many words that start with 'th' and no words that start with 'de,' it is more likely the text is in English [than Spanish]. Additionally, there are several optimizations which include only checking the language against languages with similar scripts and some heuristics that use the language of neighboring text as a hint."

Sonnet checks each paragraph of a text individually for its language, though Rideout says that it would be possible to check per-sentence. Paragraphs are used because a larger sample sizes yield more accurate results, and because checking every sentence of a large document for language would be needlessly taxing.

With Sonnet, Rideout says that a user may select a primary and backup dictionary rather than using language detection, in which case a word found to be misspelled in the primary language would be assumed to be of the backup language and spellchecked accordingly. This would be useful, for example, to doctors who must frequently use terms from a medical dictionary. Language detection offers yet more innovative functionality in the way of layout hints. For instance, a paragraph written in Hebrew (a language that is read from right to left) could be automatically right-aligned on the page.

In order to do all this computational linguistics in the background without disturbing or slowing the user interface, Sonnet uses KDE's Threadweaver technology. By intelligently dividing execution jobs into different threads via Threadweaver, Sonnet can perform language detection and spellchecking on a document without interrupting a user's typing. Rideout is among the early adopters of Threadweaver. He says, "On single processor systems the speed might sometimes be slower than the old Kspell code in theory, but not on a user-perceivable scale. On multiprocessor systems, the speed increases greatly. The ease of development is also substantially less than with other approaches."

Despite innovative features like language detection, Rideout is careful not to reinvent the wheel for Sonnet. For spellchecking, Sonnet uses a plugin system that will most likely defer to Abiword's Enchant library. Enchant, in turn, defers spellchecking duties to a variety of standard spellchecking libraries such as Aspell. To complement Enchant, Rideout is developing a grammar-checking library to be titled Elixir. Elixir will serve as a common interface to several existing free-software grammar checkers, such as An Gramadóir and LanguageTool. He expects Abiword to adopt Elixir once it is completed and he is seeking FreeDesktop.org standardization of the plugin system to be used in both Enchant and Elixir. He envisions the major KDE and GNOME text editors using Enchant and Elixir directly, while OpenOffice.org and other editors will use the standards-compliant plugins.

Genesis

The Sonnet project began with a blog post by Zack Rusin, a Trolltech developer with about half a decade's worth of KDE experience, and the principal developer of KSpell, Sonnet's predecessor. In May of 2006, Rusin proposed a "full linguistic framework" for KDE in his blog. He spoke of augmenting the one standard feature of desktop linguistics, spellchecking, with support for grammar, dictionary, thesaurus, and translation tools.

Rusin works on the Qt toolkit from which KDE is built, and thus focuses his work primarily on various aspects of computer graphics, not linguistics. He says, "Linguistics is fascinating and for some reasons there's not a whole lot of people who'd want to deal with it, at least not as far as its desktop usage goes."

At the time Rusin proposed Sonnet, Jacob Rideout had only minimal experience with KDE development, having contributed a few bug reports and patches. But Rideout was putting his linguistics degree to good use developing Phrasis, a "stripped-down text editor" designed for writers on the KDE platform.

It seemed necessary to Rideout that his text editor for writers have grammar correction ability. "I looked for a Qt/KDE wrapper for a grammar checker under the GPL that I could integrate into Phrasis and found none in KDE, but did find a plugin for Abiword. I adapted their wrapper to Qt and informed the KDE developers in case they wanted to use it. Several KDE developers, especially those working on KOffice, were excited, but there was no one willing to do the work at that time. So, after doing some research and talking to people like Zack [Rusin], I started hacking." Rideout now has his own SVN account with the KDE project and has set Phrasis aside to develop Sonnet. Rusin is continuing on with the project in an advisory capacity.

According to Rideout, "Everything is in a 75% done stage in terms of function." So far, it seems that Rideout's greatest barrier has been learning to program in a multi-threaded environment. He says, "My development methodology has traditionally been to jump in and start coding, perhaps 10-20% of the needed code. Then, I step back and examine my goals and design a proper solution. The code is refactored and the architecture is implemented.... Essentially I take the same approach to code as with prose -- write many drafts and be a remorseless self-editor." Rideout laments that this programming style has suited him well in the past, but "can create a mess" in a multi-threaded environment. He says that "Most of the work being done currently, while publicly available in the KDE Subversion repository, has yet to be fully reviewed by the veteran KDE gurus."

Rideout says that translation is a low-priority feature for Sonnet that may not be present for the release of KDE 4, though he would like to implement it in the future and hopes that others will sign on to aid the effort. Similarly, he does not list dictionary or thesaurus tools under the "core components [that] will be ready for the 4.0 release [of KDE]."

Like much of KDE 4, Sonnet promises to bring significant change to a prevailing feature of desktop operating systems. Multilingual users will see the bulk of the improvement, thanks to automatic language detection, but others may still enjoy grammar correction, text statistics, and future features such as translation, dictionary, and thesaurus functionality.

Share    Print    Comments   

Comments

on KDE 4's Sonnet will turbocharge language processing

Note: Comments are owned by the poster. We are not responsible for their content.

spell checkers

Posted by: Anonymous Coward on February 08, 2007 06:03 AM
a project involving a one to one map of word translation may be an efective intermediary to a fullfleged laguage interpreter. A person may wish to type in one languange and see what it looks like in another language. this could include morse code pig laten whatever. just one more way to involve people and make things interesting. bulding vocabulary could be a usfull way to learn passivly while chatting or just using ones computer. this mite extend to passive learning in general. selecting topics presented in a fun fact type venue. with the users permision there mite be a feature that permited a third party or member of a chat group to translate the content in another language. these scripts could be placed into either an open source or comercial sponsored dictionary of phrases that could be used as an interpreter geting closer and closer to an accurate translation or group of likly refrences a shorthand could then be used as an interpreter mite select the best meaning rather than interpreting everything from scratch.

#

Re:spell checkers

Posted by: Anonymous Coward on February 09, 2007 06:49 AM
Unfortunately, you seem to know very little about machine translation. (Not to mention the disaster that passes for your spelling would make it even more of a dream...)

Word for word translation often fails, especially when moving from a highly inflected language (in which word endings give a great deal of grammatical information) to a less-inflected one or back again. For example, Russian to English.

The better translation engines that use an approach similar to what you speak of have also a large dictionary of phrases and some grammar rules included. In addition, you run into many problems with multiple word choices for any single given word in the originating language. Software often must "guess" which meaning is intended, or else the "translation" quickly becomes rather useless.

FYI, I carry on a regular correspondence with several Russian speakers, using machine translation. That has, over the past few years, given me a certain amount of experience using various software programs that run into these issues all the time.

#

Re:spell checkers -pig latin etc

Posted by: Anonymous Coward on February 09, 2007 08:07 AM
For language variants (ala pig latin) You can use the program pig (in the bsdgames package in debian), and for other more complex things try the GNU talkfilters (google's your friend)

#

Dependencies?

Posted by: Anonymous Coward on February 08, 2007 09:10 PM
Standards are good.

It'd be interesting to know how many dependencies sonnet has. I certainly don't want to install half the KDE libs to get it to work on my Gnome desktop...

Enchant looks like it has little or none Gnome specific dependencies, that's a good thing.

#

Re:Dependencies?

Posted by: Anonymous Coward on February 17, 2007 08:13 AM
Sonnet is part of kdelibs, and is for use in KDE applications. Unfortunately that means Gnome won't be able to use it. Elixir (grammar checking) will be available without dependencies, so Gnome can use it. The other features of Sonnet will be easy for Gnome to emulate or port.

#

Praise to KDE

Posted by: Anonymous Coward on February 09, 2007 12:10 AM
I'm currently a Gnome user, but unlike KDE there don't seem to be any great plans for the upcoming major release of Gnome. Since I'm already using and appreciating amaroK I will probably consider to switch to KDE once version 4 is out. What do you guys think about that? Don't you feel that Gnome is beginning to lag behind? Or does it just not get that much media coverage?

#

Re:Praise to KDE

Posted by: Anonymous Coward on February 09, 2007 05:05 PM
Honestly, Gnome has always been behind KDE due to its technical shortcomings. The version 4 of Qt is better than ever, and I feel the gap is only going to get wider between the two major desktops.

I've also been a Gnome user for a long time, but there's no denying KDE is a better desktop for the future and I'm almost sure I'll switch to it once KDE4 is released.

#

Re:Praise to KDE

Posted by: Anonymous Coward on February 12, 2007 06:34 AM
Actually there's already a project to add a spellchecking library to glib, GSpell, who's going to provide pretty much all the functionalities described in this article, but, as usual, nobody knows anything until one day they upgrade to the new Gnome release and notice the new features. And, of course, they're going to say that it's a "KDE rip-off" even if KDE4 isn't out.

KDE has a better marketing/propaganda, this is undeniable.

#

Re:Praise to KDE

Posted by: Anonymous Coward on February 20, 2007 03:58 PM
Sorry, but GSpell is only a GObject based Enchant integration and it doesnt support nothing of the features mentioned in this article.

#

Nifty

Posted by: Anonymous Coward on February 09, 2007 05:39 AM
Spellchecking in all applications, that sound pretty nifty.

#

Language detection and desktop search

Posted by: Administrator on February 09, 2007 10:31 AM
Reliable language detection is already available.
Libtextcat 2.2 (<a href="http://software.wise-guys.nl/libtextcat/" title="wise-guys.nl">http://software.wise-guys.nl/libtextcat/</a wise-guys.nl>) implements the same technique discussed in the article, and supports 60+ languages. Version 3.0 will support more languages and have better encoding detection.
As for desktop search, Pinot (<a href="http://pinot.berlios.de/" title="berlios.de">http://pinot.berlios.de/</a berlios.de>) relies on libtextcat to identify the language of documents for stemming and filtering at search time. For instance, the search string "lang:es" will return documents in Spanish. Date ranges have been recently implemented so the next version will allow to search for "documents written in Spanish within the past week."

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya