User Tools

Site Tools


corpus:00_corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
corpus:00_corpus [2019/09/23 09:28] simonecorpus:00_corpus [2019/09/23 13:48] (current) – removed simone
Line 1: Line 1:
-====== The corpus ====== 
- 
-The corpus consists of 617 chats that were sent in by the Swiss population in 2014 through a [[corpus:01_collection|fixed procedure]] that was communicated in the press in order to get people interested. The individual chats were checked for their [[corpus:02_preprocessing|permission]] to use them and for chats that had to be [[corpus:02_preprocessing:03_removed|removed]]. Furthermore, [[corpus:demographics|demographic data]] (were provided) were linked to the chats. 
- 
-In a first step the most basic processing of the data took place such as to allow the project members to work with the data. This included the [[corpus:02_preprocessing:01_anonymization|anonymization]] and the annotation of a [[corpus:02_preprocessing:02_language_per_chat|main language]] per chat and thus the creation of [[corpus:subcorpora|subcorpora]]. 
- 
-In a later step, more [[corpus:03_annotations|annotations]] were applied to the corpus. This included a more profound annotation of [[corpus:02_preprocessing:02_languages|languages]] (i.e. each message was annotated for its language as opposed to the chat annotation performed in the first step), [[corpus:pos|part of speech annotations]] were applied and the German dialectal data was [[corpus:normalization|normalized]]. 
- 
  
corpus/00_corpus.1569223731.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki