If you aren’t actually reading a page, how do you identify the language it’s in?
This is a problem we have to think about a lot in our line of business. Knowing what language a page is in becomes vital when you are trying to spellcheck or root out any words or phrases that fall foul of the corporate style guide.
But if you’re analysing a site remotely, how do you know what language to check it against? Well, you could look at the encoding and headers on the page – but they lie.
And what if your site is in a country with more than one official language and all are present on the page? You just have to do one thing: take its fingerprint.
Now, that might just be one thing, but it’s not so easy. In fact, it takes quite a bit of research and effort to do.
First of all, you need a copy of the language fingerprint to check against the languages you find on websites. The thing with existing language histograms is that they are closely guarded by the people who compile them. So, to fingerprint a language, you either need to obtain a histogram by breaking and entering, or build your own.
Building your own histogram involves trigrams – three-letter sequences which you repeat and chart on the histogram. The result is a unique fingerprint for that language – you can then hold that against the language used on your website to see if they match.
And that’s what we did. We researched a number of languages, starting with Swedish, Norwegian, Italian, French, Dutch, Danish as well as British, Canadian and US English and built our own histograms which we use for language detection across hundreds of thousands of web sites around the world.
And although a lot of people choose to do language detection at a page level. At Magus, we’ve gone for a slightly more innovative approach: we detect language by paragraph. So, if you’re using more than language on your web page, we can detect it and apply the correct spell check.
Not quite as simple as sticking your thumb on an ink pad, but quite nifty, eh?