User Tools

Site Tools


01_corpus:02_preprocessing:04_languages

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
01_corpus:03_preprocessing:04_languages [2020/04/16 12:32] simone01_corpus:02_preprocessing:04_languages [2020/05/04 13:51] simone
Line 1: Line 1:
-====== 1.3.4 Languages and varieties ======+====== 1.2.4 Languages and varieties ======
  
 ===== Languages and varieties per chat ===== ===== Languages and varieties per chat =====
-In order to assign a language tagging to each chat, we looked the first 250 messages and assigned two possible attributes per language:+In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language:
  
   * lang_100_and_more: Languages that were found in more than 100 messages   * lang_100_and_more: Languages that were found in more than 100 messages
Line 19: Line 19:
 Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA. Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most cases, this it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA.
  
-If you want to work with all chats that contain a specific language in more than 100 messages, use the query msg & meta::lang_100_and_more=fra, gsw” on the whole corpus.+If you want to work with all chats that contain a specific language in more than 100 messages, use the query ''msg & meta::lang_100_and_more="fra, gsw"'' on the whole corpus.
  
 For an overview over languages and varieties in the corpus consult: For an overview over languages and varieties in the corpus consult:
-Ueberwasser, Simone; Stark, Elisabeth (2017)2017: "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834+Ueberwasser, Simone; Stark, Elisabeth (2017): "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834 
 + 
 + 
 +===== Languages and varieties per message ===== 
 +The information of the main language of a message is saved in the annotation //most_likely_lang// and can thus be queried with e.g. ''most_likely_lang="gsw"''
 + 
 +Available languages: 
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language 
 + 
 +Romansh varieties: 
 + 
 +  * roh-ja: Jauer Romansh 
 +  * roh-sr: romontsch sursilvan 
 +  * roh-st: rumàntsch sutsilvan 
 +  * roh-sm: rumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun 
01_corpus/02_preprocessing/04_languages.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki