A long time ago (2006) a cool guy named Jacob Rideout started work on automated language detection in the KDE spell checking framework Sonnet. Unfortunately he never finished it, and while Jakub Stachowski gave it another shot in 2009, it never got merged into a released version of Sonnet.
But last early year, I got tired of Quassel giving me constant red underlining in all my Norwegian IRC channels, and decided to finish up the code and clean it up, and get it merged. And in the process I seem to have turned into the maintainer of Sonnet.
At a very high level, the language detection scheme currently works as this:
First, it looks at all the characters in the sentence it wants to guess the language for. Thanks to QChar, we can easily find which writing system/script the characters belong to, and that allows us to filter out a bunch of languages. The list of possible languages from this is sorted by longest substring; this means that if you write a sentence in one script (for example latin), and then have a single word in for example cyrillic, it will consider latin languages first.
Unfortunately, for example latin script (like I use here in this blog post) is used by a bunch of languages, which means we need a way to efficiently guess what language a string is. The idea that Mr. Rideout originally borrowed from a perl script named “languid”, is to generate a list of the most common triple-letter strings (trigrams) in all languages, and then use that to guess which language the string most likely is.
Finally, if we haven’t been able to narrow it down to a good guess so far, we go bruteforce and just test with all available dictionaries. We simply check all words with all dictionaries available, and the dictionary that recognizes the most words is used.
This is only available in the Frameworks (5) version of Sonnet, however, so if you want this, port your applications to Qt5 and Frameworks. :-D