This is an old revision of the document!
1.3.5 Languages per message
In an iterative computational linguistic procedure based on n-grams, the most likely language per message was determined. In other words, the computational linguist looked for patterns of characters that are typical for a specific language/variant and then assigned this language/variant to all the words that showed this pattern. By comparing these annotations for tokens over the whole message, one language/variant was the "winner" and thus annotated as the most likely language/variant.
As an example, the characters <iich> are not likely to be found in any language or variant in the corpus except Swiss German dialect. If many such patterns appear in the same message, we take the most likely language variant to be Swiss German dialect. If more patterns identified as French appear, the most likely language is French etc.
The information extracted in this way is saved in the annotation most_likely_lang and can thus be queried with e.g. most_likely_lang="gsw".
Available languages:
- fra: French
- ita: Italian
- roh: Any variety of Romansh
- gsw: dialectal German as used in Switzerland
- deu: non-dialectal German
- eng: English
- spa: Spanish
- sla: Any Slavic language
Romansh varieties:
- roh-ja: Jauer Romansh
- roh-sr: romontsch sursilvan
- roh-st: rumàntsch sutsilvan
- roh-sm: rumantsch surmiran
- roh-pt: rumauntsch puter
- roh-vl: rumantsch vallader
- roh-gr: rumantsch grischun