User Tools

Site Tools


01_corpus:02_preprocessing:04_languages

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
01_corpus:03_preprocessing:04_language_per_chat [2020/04/16 11:29] simone01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== 1.3.4 Language per chat ====== +====== 1.2.4 Languages and varieties ======
-In order to assign a language tagging to each chat, student helpers read through the first 250 messages and assigned two possible attributes per language:+
  
-  * lang_100_and_more +===== Languages and varieties per chat ===== 
-  * lang_less_than_100+In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language: 
 + 
 +  * lang_100_and_more: Languages that were found in more than 100 messages 
 +  * lang_less_than_100: Languages that were less frequent
  
 for the following languages: for the following languages:
Line 15: Line 17:
   * sla: Any Slavic language   * sla: Any Slavic language
  
-**Please note:** In the browsing tool ANNIS, we created [[01_corpus:01_subcorpora|sub-corpora]] per language, where each message appears in one and only one sub-corpus, even though there may be several languages annotated as lang_100_and_more for a specific chat. If you want to work with all chats that contain a specific language in more than 100 messages, use the query //msg & meta::lang_100_and_more=fra, gsw”// on the whole corpus.+Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most casesthis it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA. 
 + 
 +If you want to work with all chats that contain a specific language in more than 100 messages, use the query ''msg & meta::lang_100_and_more="fra, gsw"'' on the whole corpus. 
 + 
 +For an overview over languages and varieties in the corpus consult: 
 +Ueberwasser, Simone; Stark, Elisabeth (2017): "What’s up, Switzerland? A corpus-based research project in a multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834 
 + 
 + 
 +===== Languages and varieties per message ===== 
 +The information of the main language of a message is saved in the annotation //most_likely_lang// and can thus be queried with e.g. ''most_likely_lang="gsw"''
 + 
 +Available languages: 
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language 
 + 
 +Romansh varieties:
  
-For an overview over languages in the corpus consult+  * roh-jaJauer Romansh 
-Ueberwasser, Simone; Stark, Elisabeth (2017)2017"What’s up, Switzerland? A corpus-based research project in a multilingual country". InLinguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834+  * roh-srromontsch sursilvan 
 +  * roh-strumàntsch sutsilvan 
 +  * roh-smrumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun 
01_corpus/02_preprocessing/04_languages.1587029373.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki