01_corpus:02_preprocessing:04_languages
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
01_corpus:03_preprocessing:04_language_per_chat [2020/04/16 11:27] – simone | 01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== 1.3.4 Language per chat ====== | + | ====== 1.2.4 Languages and varieties |
- | In order to assign a language tagging to each chat, student helpers read through the first 250 messages and assigned two possible attributes per language: | + | |
- | * lang_100_and_more | + | ===== Languages and varieties per chat ===== |
- | | + | In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language: |
- | Available | + | * lang_100_and_more: |
+ | * lang_less_than_100: | ||
+ | |||
+ | for the following | ||
* fra: French | * fra: French | ||
* ita: Italian | * ita: Italian | ||
Line 15: | Line 17: | ||
* sla: Any Slavic language | * sla: Any Slavic language | ||
- | **Please note:** In the browsing tool ANNIS, we created | + | Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages |
+ | |||
+ | If you want to work with all chats that contain a specific language in more than 100 messages, use the query '' | ||
+ | |||
+ | For an overview over languages | ||
+ | Ueberwasser, | ||
+ | |||
+ | |||
+ | ===== Languages and varieties per message ===== | ||
+ | The information of the main language of a message is saved in the annotation | ||
+ | |||
+ | Available | ||
+ | * fra: French | ||
+ | * ita: Italian | ||
+ | * roh: Any variety of Romansh | ||
+ | * gsw: dialectal German | ||
+ | * deu: non-dialectal German | ||
+ | * eng: English | ||
+ | * spa: Spanish | ||
+ | * sla: Any Slavic language | ||
- | We thus considered the main language (as defined above) that provided the most messages to be the top main language and assigned it to the according subcorpus. If you want to work with all chats that contain a specific language in more than 100 messages, use the query //msg & meta:: | + | Romansh varieties: |
- | For an overview over languages in the corpus consult: | + | * roh-ja: Jauer Romansh |
- | Ueberwasser, | + | * roh-sr: romontsch sursilvan |
+ | * roh-st: rumàntsch sutsilvan | ||
+ | * roh-sm: rumantsch surmiran | ||
+ | * roh-pt: rumauntsch puter | ||
+ | * roh-vl: rumantsch vallader | ||
+ | * roh-gr: rumantsch grischun |
01_corpus/02_preprocessing/04_languages.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1