01_corpus:02_preprocessing:04_languages
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
01_corpus:02_preprocessing:03_language_per_chat [2020/01/05 14:21] – simone | 01_corpus:02_preprocessing:04_languages [2020/05/04 13:49] – simone | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
- | In a first step and in order to provide adequate data to the researchers in the team, student helpers looked at every chat. Their approach was: | + | |
- | * Read through the individual | + | ===== Languages and varieties per chat ===== |
- | * If at this point, any other language/ | + | In order to assign |
- | * Mark the chat as containing one or more main languages (e.g. attribute: lang_100_and_more=" | + | |
- | * Take note of the other languages found in the course of this process (e.g. attribute: contains_eng=" | + | |
- | Available | + | * lang_100_and_more: |
+ | * lang_less_than_100: | ||
+ | |||
+ | for the following | ||
* fra: French | * fra: French | ||
* ita: Italian | * ita: Italian | ||
Line 17: | Line 17: | ||
* sla: Any Slavic language | * sla: Any Slavic language | ||
- | **Please note:** In the browsing tool ANNIS, we created | + | Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages |
- | We thus considered the main language (as defined above) that provided the most messages to be the top main language and assigned it to the according subcorpus. | + | If you want to work with all chats that contain a specific language in more than 100 messages, use the query '' |
- | For an overview over languages in the corpus consult: | + | For an overview over languages |
Ueberwasser, | Ueberwasser, | ||
+ | |||
+ | |||
+ | ===== Languages and varieties per message ===== | ||
+ | The information of the main language of a message is saved in the annotation // | ||
+ | |||
+ | Available languages: | ||
+ | * fra: French | ||
+ | * ita: Italian | ||
+ | * roh: Any variety of Romansh | ||
+ | * gsw: dialectal German as used in Switzerland | ||
+ | * deu: non-dialectal German | ||
+ | * eng: English | ||
+ | * spa: Spanish | ||
+ | * sla: Any Slavic language | ||
+ | |||
+ | Romansh varieties: | ||
+ | |||
+ | * roh-ja: Jauer Romansh | ||
+ | * roh-sr: romontsch sursilvan | ||
+ | * roh-st: rumàntsch sutsilvan | ||
+ | * roh-sm: rumantsch surmiran | ||
+ | * roh-pt: rumauntsch puter | ||
+ | * roh-vl: rumantsch vallader | ||
+ | * roh-gr: rumantsch grischun |
01_corpus/02_preprocessing/04_languages.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1