A long time ago (2006) a cool guy named Jacob Rideout started work on automated language detection in the KDE spell checking framework Sonnet. Unfortunately he never finished it, and while Jakub Stachowski gave it another shot in 2009, it never got merged into a released version of Sonnet.
But last early year, I got tired of Quassel giving me constant red underlining in all my Norwegian IRC channels, and decided to finish up the code and clean it up, and get it merged. And in the process I seem to have turned into the maintainer of Sonnet.
At a very high level, the language detection scheme currently works as this:
First, it looks at all the characters in the sentence it wants to guess the language for. Thanks to QChar, we can easily find which writing system/script the characters belong to, and that allows us to filter out a bunch of languages. The list of possible languages from this is sorted by longest substring; this means that if you write a sentence in one script (for example latin), and then have a single word in for example cyrillic, it will consider latin languages first.
Unfortunately, for example latin script (like I use here in this blog post) is used by a bunch of languages, which means we need a way to efficiently guess what language a string is. The idea that Mr. Rideout originally borrowed from a perl script named “languid”, is to generate a list of the most common triple-letter strings (trigrams) in all languages, and then use that to guess which language the string most likely is.
Finally, if we haven’t been able to narrow it down to a good guess so far, we go bruteforce and just test with all available dictionaries. We simply check all words with all dictionaries available, and the dictionary that recognizes the most words is used.
This is only available in the Frameworks (5) version of Sonnet, however, so if you want this, port your applications to Qt5 and Frameworks. :-D
Other cool stuff I plan on implementing is grammar checking, readability scoring and text completion. For grammar checking I plan on using linkgrammar, for the languages it supports (unfortunately not many, but it isn’t that hard to create support for new languages), and re-using the datafiles from the LanguageTool for OpenOffice; re-use the XML files as is as much as possible, and rewrite the Java snippets into JavaScript. I also want to use this language detection in Baloo, so that it can automatically tag the language of the files it indexes (it was originally integrated into Strigi as well).