This is an old revision of the document!

1.3.5 Languages per message

In an iterative computational linguistic procedure based on n-grams, the most likely language per message was determined. In other words, the computational linguist looked for patterns of characters that are typical for a specific language/variant and then assigned this language/variant to all the words that showed this pattern. By comparing these annotations for tokens over the whole message, one language/variant was the "winner" and thus annotated as the most likely language/variant.

As an example, the characters <iich> are not likely to be found in any language or variant in the corpus except Swiss German dialect. If many such patterns appear in the same message, we take the most likely language variant to be Swiss German dialect. If more patterns identified as French appear, the most likely language is French etc.

The information extracted in this way is saved in the annotation most_likely_lang and can thus be queried with e.g. most_likely_lang="gsw".

Available languages:

fra: French
ita: Italian
roh: Any variety of Romansh
gsw: dialectal German as used in Switzerland
deu: non-dialectal German
eng: English
spa: Spanish
sla: Any Slavic language

Romansh varieties:

roh-ja: Jauer Romansh
roh-sr: romontsch sursilvan
roh-st: rumàntsch sutsilvan
roh-sm: rumantsch surmiran
roh-pt: rumauntsch puter
roh-vl: rumantsch vallader
roh-gr: rumantsch grischun