User Tools

Site Tools


01_corpus:02_preprocessing:04_languages

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
01_corpus:03_preprocessing:04_languages [2020/04/16 12:32] simone01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== 1.3.4 Languages and varieties ======+====== 1.2.4 Languages and varieties ======
  
 ===== Languages and varieties per chat ===== ===== Languages and varieties per chat =====
-In order to assign a language tagging to each chat, we looked the first 250 messages and assigned two possible attributes per language:+In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language:
  
   * lang_100_and_more: Languages that were found in more than 100 messages   * lang_100_and_more: Languages that were found in more than 100 messages
Line 19: Line 19:
 Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA. Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA.
  
-If you want to work with all chats that contain a specific language in more than 100 messages, use the query msg & meta::lang_100_and_more=fra, gsw” on the whole corpus.+If you want to work with all chats that contain a specific language in more than 100 messages, use the query ''msg & meta::lang_100_and_more="fra, gsw"'' on the whole corpus.
  
 For an overview over languages and varieties in the corpus consult: For an overview over languages and varieties in the corpus consult:
-Ueberwasser, Simone; Stark, Elisabeth (2017)2017: "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834+Ueberwasser, Simone; Stark, Elisabeth (2017): "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834 
 + 
 + 
 +===== Languages and varieties per message ===== 
 +The information of the main language of a message is saved in the annotation //most_likely_lang// and can thus be queried with e.g. ''most_likely_lang="gsw"''
 + 
 +Available languages: 
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language 
 + 
 +Romansh varieties: 
 + 
 +  * roh-ja: Jauer Romansh 
 +  * roh-sr: romontsch sursilvan 
 +  * roh-st: rumàntsch sutsilvan 
 +  * roh-sm: rumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun 
01_corpus/02_preprocessing/04_languages.1587033122.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki