01_corpus:01_subcorpora
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
01_corpus:subcorpora [2019/11/06 09:26] – [More information about the subcorpora] simone | 01_corpus:01_subcorpora [2022/01/05 14:44] – Simone Ueberwasser | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== 1.5 Sub-corpora ====== | + | ====== 1.1 Sub-corpora ====== |
- | Based on the annotation of the languages per chat, different sub-corpora were created. | + | Based on the annotation of the languages per chat, different sub-corpora were created. |
The following basic considerations were applied when creating the sub-corpora: | The following basic considerations were applied when creating the sub-corpora: | ||
Line 6: | Line 6: | ||
===== Definitions for sub-corpora ===== | ===== Definitions for sub-corpora ===== | ||
- | * Each chat was to be assigned to only one language-sub-corpus. As mentioned [[01_corpus: | + | * Each chat was to be assigned to only one language-sub-corpus. |
* Additionally, | * Additionally, | ||
* Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language. | * Where additional tasks were performed on individual chats (e.g. normalization or part-of-speech tagging) we created additional sub-corpora per language. | ||
Line 13: | Line 13: | ||
===== Main sub-corpora ===== | ===== Main sub-corpora ===== | ||
- | These rules result in the following main corpora: | ||
* WUS: All data, i.e. the whole corpus | * WUS: All data, i.e. the whole corpus | ||
* WUS_DEU: All data where non-dialectal German provides the most messages | * WUS_DEU: All data where non-dialectal German provides the most messages | ||
- | * WUS_DEU_DEMOG: | + | * WUS_DEU_DEMOG: |
* WUS_FRA: All data where French provides the most messages | * WUS_FRA: All data where French provides the most messages | ||
- | * WUS_FRA_DEMOG: | + | * WUS_FRA_DEMOG: |
* WUS_GSW: All data where dialectal German provides the most messages | * WUS_GSW: All data where dialectal German provides the most messages | ||
- | * WUS_GSW_DEMOG: | + | * WUS_GSW_DEMOG: |
* WUS_ITA: All data where Italian provides the most messages | * WUS_ITA: All data where Italian provides the most messages | ||
- | * WUS_ITA_DEMOG: | + | * WUS_ITA_DEMOG: |
- | * *WUS_ITA_TT: | + | |
* WUS_ROH: All data where Romansh provides the most messages | * WUS_ROH: All data where Romansh provides the most messages | ||
- | * WUS_ROH_DEMOG: | + | * WUS_ROH_DEMOG: |
+ | Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, | ||
===== Smaller corpora ===== | ===== Smaller corpora ===== | ||
- | Next to these main corpora, there are some smaller corpora: | + | Next to these main sub-corpora, there are some smaller |
- | * WUS_SMALL: | + | * WUS_SMALL: |
- | * WUS_SMALL_DEMOG: | + | * WUS_SMALL_DEMOG: |
- | * WUSdemographics: | + | * WUSdemographics: |
- | * WUS_ARGDROP and WUS_ARGDROP_language: | + | * WUS_ARGDROP and WUS_ARGDROP_language: |
+ | ===== Other corpora in the browsing tool ===== | ||
+ | Additionally to these corpora, you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, | ||
===== More information about the subcorpora ===== | ===== More information about the subcorpora ===== |
01_corpus/01_subcorpora.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1