User Tools

Site Tools


01_corpus:01_subcorpora

1.1 Sub-corpora

Based on the annotation of the languages per chat, different sub-corpora were created.

The following basic considerations were applied when creating the sub-corpora:

Definitions for sub-corpora

  • Each chat was to be assigned to only one language-sub-corpus.
  • Additionally, we differentiate between chats where we have demographic information for all participants and those where we do not. In the former case, the sub-corpus gets the extension _DEMOG.
  • Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language.

Main sub-corpora

  • WUS: All data, i.e. the whole corpus
  • WUS_DEU: All data where non-dialectal German provides the most messages
  • WUS_DEU_DEMOG: A subgroup thereof where we have demographic information from all communication partners.
  • WUS_FRA: All data where French provides the most messages
  • WUS_FRA_DEMOG: A subgroup thereof where we have demographic information from all communication partners.
  • WUS_GSW: All data where dialectal German provides the most messages
  • WUS_GSW_DEMOG: A subgroup thereof where we have demographic information from all communication partners.
  • WUS_ITA: All data where Italian provides the most messages
  • WUS_ITA_DEMOG: A subgroup thereof where we have demographic information from all communication partners.
  • WUS_ROH: All data where Romansh provides the most messages
  • WUS_ROH_DEMOG: A subgroup thereof where we have demographic information from all communication partners.

Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our SMS project.

Smaller corpora

Next to these main sub-corpora, there are some smaller sub-corpora:

Other corpora in the browsing tool

Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our SMS project.

More information about the subcorpora

The individual sub-corpora are well documented in terms of size etc. within the browsing tool. Check the according section for more information.

01_corpus/01_subcorpora.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki