01_corpus:02_preprocessing:04_languages
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
01_corpus:03_preprocessing:04_languages [2020/04/16 13:39] – simone | 01_corpus:02_preprocessing:04_languages [2020/05/04 13:51] – simone | ||
---|---|---|---|
Line 2: | Line 2: | ||
===== Languages and varieties per chat ===== | ===== Languages and varieties per chat ===== | ||
- | In order to assign a language tagging to each chat, we looked the first 250 messages and assigned two possible attributes per language: | + | In order to assign a language tagging to each chat, we looked |
* lang_100_and_more: | * lang_100_and_more: | ||
Line 19: | Line 19: | ||
Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/ | Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/ | ||
- | If you want to work with all chats that contain a specific language in more than 100 messages, use the query msg & meta:: | + | If you want to work with all chats that contain a specific language in more than 100 messages, use the query '' |
For an overview over languages and varieties in the corpus consult: | For an overview over languages and varieties in the corpus consult: | ||
- | Ueberwasser, | + | Ueberwasser, |
- | ===== 1.3.5 Languages and varieties per message ===== | + | ===== Languages and varieties per message ===== |
- | In an iterative computational linguistic procedure based on n-grams, | + | The information of the main language of a message is saved in the annotation |
- | + | ||
- | As an example, the characters < | + | |
- | + | ||
- | The information extracted in this way is saved in the annotation most_likely_lang and can thus be queried with e.g. //most_likely_lang=" | + | |
Available languages: | Available languages: |
01_corpus/02_preprocessing/04_languages.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1