01_corpus:02_preprocessing:04_languages
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
01_corpus:03_preprocessing:04_language_per_chat [2020/04/15 08:35] – ↷ Page moved from 01_corpus:02_preprocessing:04_language_per_chat to 01_corpus:03_preprocessing:04_language_per_chat simone | 01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== 1.2.4 Language per chat ====== | + | ====== 1.2.4 Languages and varieties |
- | In a first step and in order to provide adequate data to the researchers in the team, student helpers looked at every chat. Their approach was: | + | |
- | * Read through the individual | + | ===== Languages and varieties per chat ===== |
- | * If at this point, any other language/ | + | In order to assign |
- | * Mark the chat as containing one or more main languages (e.g. attribute: lang_100_and_more=" | + | |
- | * Take note of the other languages found in the course of this process (e.g. attribute: contains_eng=" | + | |
- | Available | + | * lang_100_and_more: |
+ | * lang_less_than_100: | ||
+ | |||
+ | for the following | ||
* fra: French | * fra: French | ||
* ita: Italian | * ita: Italian | ||
Line 17: | Line 17: | ||
* sla: Any Slavic language | * sla: Any Slavic language | ||
- | **Please note:** In the browsing tool ANNIS, we created | + | Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages |
+ | |||
+ | If you want to work with all chats that contain a specific language in more than 100 messages, use the query '' | ||
+ | |||
+ | For an overview over languages | ||
+ | Ueberwasser, | ||
+ | |||
+ | |||
+ | ===== Languages and varieties per message ===== | ||
+ | The information of the main language of a message is saved in the annotation | ||
+ | |||
+ | Available | ||
+ | * fra: French | ||
+ | * ita: Italian | ||
+ | * roh: Any variety of Romansh | ||
+ | * gsw: dialectal German | ||
+ | * deu: non-dialectal German | ||
+ | * eng: English | ||
+ | * spa: Spanish | ||
+ | * sla: Any Slavic language | ||
- | We thus considered the main language (as defined above) that provided the most messages to be the top main language and assigned it to the according subcorpus. If you want to work with all chats that contain a specific language in more than 100 messages, use the query //msg & meta:: | + | Romansh varieties: |
- | For an overview over languages in the corpus consult: | + | * roh-ja: Jauer Romansh |
- | Ueberwasser, | + | * roh-sr: romontsch sursilvan |
+ | * roh-st: rumàntsch sutsilvan | ||
+ | * roh-sm: rumantsch surmiran | ||
+ | * roh-pt: rumauntsch puter | ||
+ | * roh-vl: rumantsch vallader | ||
+ | * roh-gr: rumantsch grischun |
01_corpus/02_preprocessing/04_languages.1586932548.txt.gz · Last modified: 2022/06/27 09:21 (external edit)