User Tools

Site Tools


01_corpus:02_preprocessing:04_languages

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
01_corpus:03_preprocessing:04_languages [2020/04/16 11:50] simone01_corpus:02_preprocessing:04_languages [2020/05/04 13:51] simone
Line 1: Line 1:
-====== 1.3.4 Languages and varieties ======+====== 1.2.4 Languages and varieties ======
  
 ===== Languages and varieties per chat ===== ===== Languages and varieties per chat =====
-In order to assign a language tagging to each chat, student helpers read through the first 250 messages and assigned two possible attributes per language:+In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language:
  
-  * lang_100_and_more +  * lang_100_and_more: Languages that were found in more than 100 messages 
-  * lang_less_than_100+  * lang_less_than_100: Languages that were less frequent
  
 for the following languages: for the following languages:
Line 17: Line 17:
   * sla: Any Slavic language   * sla: Any Slavic language
  
-**Please note:** In the browsing tool ANNIS, we created [[01_corpus:01_subcorpora|sub-corpora]] per language, where each message appears in one and only one sub-corpus, even though there may be several languages annotated as lang_100_and_more for a specific chat. If you want to work with all chats that contain a specific language in more than 100 messages, use the query //msg & meta::lang_100_and_more=“fra, gsw”// on the whole corpus.+Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most casesthis it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languagesROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA.
  
-For an overview over languages in the corpus consult: +If you want to work with all chats that contain a specific language in more than 100 messages, use the query ''msg & meta::lang_100_and_more="fra, gsw"'' on the whole corpus. 
-Ueberwasser, Simone; Stark, Elisabeth (2017)2017: "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834+ 
 +For an overview over languages and varieties in the corpus consult: 
 +Ueberwasser, Simone; Stark, Elisabeth (2017): "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834 
 + 
 + 
 +===== Languages and varieties per message ===== 
 +The information of the main language of a message is saved in the annotation //most_likely_lang// and can thus be queried with e.g. ''most_likely_lang="gsw"''
 + 
 +Available languages: 
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language 
 + 
 +Romansh varieties: 
 + 
 +  * roh-ja: Jauer Romansh 
 +  * roh-sr: romontsch sursilvan 
 +  * roh-st: rumàntsch sutsilvan 
 +  * roh-sm: rumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun 
01_corpus/02_preprocessing/04_languages.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki