Automated language detection

A book through a looking glass
The KDE spell checking library.

A long time ago (2006) a cool guy named Jacob Rideout started work on automated language detection in the KDE spell checking framework Sonnet. Unfortunately he never finished it, and while Jakub Stachowski gave it another shot in 2009, it never got merged into a released version of Sonnet.

But last early year, I got tired of Quassel giving me constant red underlining in all my Norwegian IRC channels, and decided to finish up the code and clean it up, and get it merged. And in the process I seem to have turned into the maintainer of Sonnet.

At a very high level, the language detection scheme currently works as this:

First, it looks at all the characters in the sentence it wants to guess the language for. Thanks to QChar, we can easily find which writing system/script the characters belong to, and that allows us to filter out a bunch of languages. The list of possible languages from this is sorted by longest substring; this means that if you write a sentence in one script (for example latin), and then have a single word in for example cyrillic, it will consider latin languages first.

Unfortunately, for example latin script (like I use here in this blog post) is used by a bunch of languages, which means we need a way to efficiently guess what language a string is. The idea that Mr. Rideout originally borrowed from a perl script named “languid”, is to generate a list of the most common triple-letter strings (trigrams) in all languages, and then use that to guess which language the string most likely is.

Finally, if we haven’t been able to narrow it down to a good guess so far, we go bruteforce and just test with all available dictionaries. We simply check all words with all dictionaries available, and the dictionary that recognizes the most words is used.

This is only available in the Frameworks (5) version of Sonnet, however, so if you want this, port your applications to Qt5 and Frameworks. :-D

Other cool stuff I plan on implementing is grammar checking, readability scoring and text completion. For grammar checking I plan on using linkgrammar, for the languages it supports (unfortunately not many, but it isn’t that hard to create support for new languages), and re-using the datafiles from the LanguageTool for OpenOffice; re-use the XML files as is as much as possible, and rewrite the Java snippets into JavaScript. I also want to use this language detection in Baloo, so that it can automatically tag the language of the files it indexes (it was originally integrated into Strigi as well).

Advertisements

12 thoughts on “Automated language detection

  1. A more simple approach would be selecting the language according to your current keyboard layout.

    1. This not accurate enoght. I cannot switch keyboard layoud when I write in english, for example. Moreover, I hope the system is also able to detect changes of lenguage “inside” the same text (for exmaple when writing and email).

      1. it tags every sentence with the language it thinks it is. the sentence splitter should work for all writing systems, but as I’m not really fluid in many languages that doesn’t use latin characters I can’t test it very extensively.

  2. This is pretty cool, I remember the blog post back in 2006 and it sounded quite exciting :). Took a few years to come into reality, but better late than never!

  3. Eh khm, this needs a careful choice of words…

    OMFGI’veBeenWaitingForAgesForThis.ILoveYouInAPurelyPlatonicWay.CanIHaveYourCodeBabies!!

    Em, I feel better now, thanks ☺

    But seriously, you just made my day even better than it already was – I was hopefully waiting for this feature ever since it was announced way back and this would help me a ton! Every day I write in about 2-4 languages and this means that spell-checking is a huge pain in the butt – especially in programs like Konversation, KTP and for me most improtantly KMail. At least in KMail I found a work around through duplicated identities, but that changing those is just a slightly lesser pain than changing the spellchecker language directly.

    I am very *very* much looking forward to this feature and I’m very hopeful for it ☺

  4. Hybrid models of unigrams, bigrams and trigrams work well for this. You may not have three, consecutive terms that your model recognizes, but you may have one or two; this is where the unigrams and bigrams become helpful. I’ve also found extending the set of n-grams to include quadgrams beneficial.

  5. Hi, Concerning grammar checking, a starting point could be to use the LIMA multilingual analyzer (https://github.com/aymara/lima/wiki). It has all processing units necessary for natural language processing (including tokenizing, PoS tagging and parsing but also named entities and semantic analysis. I’m also willing to help for integretion, bug hunting,etc.

    I had has a project a few years ago to use LIMA in nepomuk for text processing before indexation but it was not free at this time and we did not get the expected funding. Now, it’s free and I can work on these tasks on my free time.

    LIMA is in C++, is partly Qt based (port to Qt5 will start very soon) and easy to integrate at API level or as a server.

      1. I forgot to aprecise that LIMA supports currently English and French but we also have, not free up to now, support for Spanish, German, Arabic, Chinese, Russian and others. I don’t know if they will be freed sometimes but it shows that it is possible.
        Also, parsing performance is not as high as it was in the commercial version as we had no time to completely adapt parsing rules to the new tagset. (See my paper at the LREC 2014 conference).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s