User Tools

Site Tools


01_corpus:02_preprocessing:04_languages

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
01_corpus:03_preprocessing:03_language_per_chat [2020/04/14 15:18] – ↷ Page moved from 01_corpus:02_preprocessing:03_language_per_chat to 01_corpus:03_preprocessing:03_language_per_chat simone01_corpus:02_preprocessing:04_languages [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
-====== Language per chat ====== +====== 1.2.4 Languages and varieties ======
-In a first step and in order to provide adequate data to the researchers in the team, student helpers looked at every chat. Their approach was:+
  
-  * Read through the individual chat until you have come across 100 messages in one and the same language (or variety in the case of German, where we differentiate between the Swiss German Dialect and not dialectal German).  +===== Languages and varieties per chat ===== 
-  * If at this point, any other language/variety provides more than 50 messages, read on until you have read total of 250 messages. If, in the process of this reading, another language comes to 100 messagesconsider both languages as main language. If not, only the language providing more than 100 messages is the main language+In order to assign a language tagging to each chatwe looked at the first 250 messages and assigned two possible attributes per language:
-  * Mark the chat as containing one or more main languages (e.g. attributelang_100_and_more="fra, gsw"), also as lang_less_than_100="deu, eng, gsw" e.g. when only a few messages of e.g. these 3 languages are contained in the same chat. +
-  * Take note of the other languages found in the course of this process (e.g. attribute: contains_eng="true").+
  
-Available languages:+  * lang_100_and_more: Languages that were found in more than 100 messages 
 +  * lang_less_than_100: Languages that were less frequent 
 + 
 +for the following languages:
   * fra: French   * fra: French
   * ita: Italian   * ita: Italian
Line 17: Line 17:
   * sla: Any Slavic language   * sla: Any Slavic language
  
-**Please note:** In the browsing tool ANNIS, we created [[01_corpus:subcorpora|sub-corpora]] per language, where each message appears in one and only one sub-corpus, even though there may be several languages annotated as lang_100_and_more for a specific chat. In order to assign a chat to a sub-corpus, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA; Thus, e.g. a chat with English and German annotated with lang_100_and_more languages was assigned to the German subcorpus; a chat with the annotation lang_100_and_more for Romansh and and any other languages annotated as lang_100_and_more will only be found in the WUS_ROH(_DEMOG) subcorpus.+Please note: In the browsing tool ANNIS, we created sub-corpora per language, where each message appears in one and only one sub-corpus. In most casesthis it the language that delivers more than 100 chats. If there are two languages providing more than 100 messages, we arbitrarily prioritized the languages: ROH > GSW > FRA > DEU > ITA > ENG/SPA/SLA. 
 + 
 +If you want to work with all chats that contain a specific language in more than 100 messages, use the query ''msg & meta::lang_100_and_more="fra, gsw"'' on the whole corpus. 
 + 
 +For an overview over languages and varieties in the corpus consult: 
 +Ueberwasser, SimoneStark, Elisabeth (2017): "What’s up, Switzerland? A corpus-based research project in multilingual country". In: Linguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834 
 + 
 + 
 +===== Languages and varieties per message ===== 
 +The information of the main language of a message is saved in the annotation //most_likely_lang// and can thus be queried with e.g. ''most_likely_lang="gsw"''
 + 
 +Available languages
 +  * fra: French 
 +  * ita: Italian 
 +  * roh: Any variety of Romansh 
 +  * gsw: dialectal German as used in Switzerland 
 +  * deu: non-dialectal German 
 +  * eng: English 
 +  * spa: Spanish 
 +  * sla: Any Slavic language
  
-We thus considered the main language (as defined above) that provided the most messages to be the top main language and assigned it to the according subcorpus. If you want to work with all chats that contain a specific language in more than 100 messages, use the query //msg & meta::lang_100_and_more=“fra, gsw”// on the whole corpus.+Romansh varieties:
  
-For an overview over languages in the corpus consult+  * roh-jaJauer Romansh 
-Ueberwasser, Simone; Stark, Elisabeth (2017)2017"What’s up, Switzerland? A corpus-based research project in a multilingual country". InLinguistik online, 84/5, 105-126. https://bop.unibe.ch/linguistik-online/article/view/3849/5834+  * roh-srromontsch sursilvan 
 +  * roh-strumàntsch sutsilvan 
 +  * roh-smrumantsch surmiran 
 +  * roh-pt: rumauntsch puter 
 +  * roh-vl: rumantsch vallader 
 +  * roh-gr: rumantsch grischun 
01_corpus/02_preprocessing/04_languages.1586870336.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki