<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="FeedCreator 1.8" -->
<?xml-stylesheet href="https://whatsup.linguistik.uzh.ch/lib/exe/css.php?s=feed" type="text/css"?>
<rdf:RDF
    xmlns="http://purl.org/rss/1.0/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel rdf:about="https://whatsup.linguistik.uzh.ch/feed.php">
        <title> - 01_corpus:02_preprocessing</title>
        <description></description>
        <link>https://whatsup.linguistik.uzh.ch/</link>
        <image rdf:resource="https://whatsup.linguistik.uzh.ch/_media/wiki/logo.png" />
       <dc:date>2026-04-14T22:01:08+00:00</dc:date>
        <items>
            <rdf:Seq>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/01_anonymization?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/02_without_permission?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/03_emojis?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/04_languages?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/05_technical_messages?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/06_pos?rev=1656314505&amp;do=diff"/>
                <rdf:li rdf:resource="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/07_normalization?rev=1656314505&amp;do=diff"/>
            </rdf:Seq>
        </items>
    </channel>
    <image rdf:about="https://whatsup.linguistik.uzh.ch/_media/wiki/logo.png">
        <title></title>
        <link>https://whatsup.linguistik.uzh.ch/</link>
        <url>https://whatsup.linguistik.uzh.ch/_media/wiki/logo.png</url>
    </image>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/01_anonymization?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>01_anonymization</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/01_anonymization?rev=1656314505&amp;do=diff</link>
        <description>1.2.1 Anonymization

General privacy

People who write WhatsApp chats can be recognized either by the stories they tell or by the names and places they mention. While we had no way to address the former, we addressed the latter by means of computational linguistics. If you happen to recognize informants based on the remaining information, we ask you to comply with common</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/02_without_permission?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>02_without_permission</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/02_without_permission?rev=1656314505&amp;do=diff</link>
        <description>1.2.2 Data without permission

During the data collection process, not all communication partners in all the chats gave their permission for their texts to be used. In chats where we did not get the permission of all participants, we still used the messages for which we had the permission and disguised those without.</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/03_emojis?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>03_emojis</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/03_emojis?rev=1656314505&amp;do=diff</link>
        <description>1.2.3. Emojis

Emojis are characters in Unicode. The application WhatsApp uses special fonts such as to have the same appearance of emojis on all operation systems. In our corpus browsers, emojis can be displayed, but they are represented in the font that is used by the user, thus, it cannot be guaranteed that an emoji in the original text looked as it does on your screen.</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/04_languages?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>04_languages</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/04_languages?rev=1656314505&amp;do=diff</link>
        <description>1.2.4 Languages and varieties

Languages and varieties per chat

In order to assign a language tagging to each chat, we looked at the first 250 messages and assigned two possible attributes per language:

	*  lang_100_and_more: Languages that were found in more than 100 messages</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/05_technical_messages?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>05_technical_messages</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/05_technical_messages?rev=1656314505&amp;do=diff</link>
        <description>1.2.5 Technical messages

WhatsApp sometimes generates messages when something happens in a chat. If, for example, a user leaves a group, the message can be Peter left or Peter hat die Gruppe verlassen.  In order to get one wording for these messages, we encoded them, e.g.</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/06_pos?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>06_pos</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/06_pos?rev=1656314505&amp;do=diff</link>
        <description>1.2.6 Part of Speech Tagging

Some sub-corpora have been annotated with Part Of Speech annotations. This concerns WUS_DIALOG_GSW, WUS_FRA, 
WUS_FRA_DEMOG, WUS_ITA, WUS_ITA_DEMOG.

French

The whole French corpus has been annotated with MElt (Modified French TreeBank) using the tag set</description>
    </item>
    <item rdf:about="https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/07_normalization?rev=1656314505&amp;do=diff">
        <dc:format>text/html</dc:format>
        <dc:date>2022-06-27T07:21:45+00:00</dc:date>
        <dc:creator>Anonymous (anonymous@undisclosed.example.com)</dc:creator>
        <title>07_normalization</title>
        <link>https://whatsup.linguistik.uzh.ch/01_corpus/02_preprocessing/07_normalization?rev=1656314505&amp;do=diff</link>
        <description>1.2.7 Normalization

Normalization is the task of &quot;translating&quot; non-standard language data into standard language. It can be performed manually or automatically with computational linguistics tools.

In the case of our corpus, we have manually normalized some data in the Swiss German dialect, resulting in the corpus WUS_DIALOG_GSW (5 chats, 34,683 tokens).</description>
    </item>
</rdf:RDF>
