01_corpus:02_preprocessing:01_anonymization
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
Previous revisionLast revision | |||
— | 01_corpus:02_preprocessing:01_anonymization [2020/04/16 13:38] – simone | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 1.2.1 Anonymization ====== | ||
+ | |||
+ | ===== General privacy ===== | ||
+ | People who write WhatsApp chats can be recognized either by the stories they tell or by the names and places they mention. While we had no way to address the former, we addressed the latter by means of computational linguistics. If you happen to recognize informants based on the remaining information, | ||
+ | |||
+ | |||
+ | ===== First names ===== | ||
+ | |||
+ | All first names found are based on freely available reference lists in the respective language. It was then decided to not actually remove first names, but to rotate them, meaning that the name Peter in a chat would not get replaced by e.g. [FirstName], | ||
+ | |||
+ | - The text remains easy to read. | ||
+ | - Because Peter is always replaced with Ferdinand, all occurrences of the same name remain the same. Conversations can therefore be more easier followed. | ||
+ | |||
+ | Tests showed, that more than 95% of all first names were found and rotated in this way. | ||
+ | |||
+ | We tried to assign the same sex to the rotated names as to the original one, such as to keep the text readable. Very often, this resulted in good replacements, | ||
+ | |||
+ | ===== Last names ===== | ||
+ | Only very few last names can in fact be found in the data. It was decided to replace all last names with [LastName]. | ||
+ | |||
+ | |||
+ | ===== Numbers ===== | ||
+ | In order to remove information about phone numbers, bank accounts etc., all numbers with three and more digits where removed and each digit was replaced with one N. Reliability here lies at 100%. | ||
+ | |||
+ | ===== E-Mail addresses ===== | ||
+ | All email addresses were removed and replaced with xxx@yyy.ch, while keeping the number of characters. info@uzh.ch would therefore become xxxx@yyy.ch, | ||
+ | |||
+ | ===== Street addresses ===== | ||
+ | Street addresses were removed and replaced by [StreetAddress]. | ||
+ | |||
+ | ===== WWW addresses ===== | ||
+ | WWW addresses were kept since they contain information publicly available. | ||
+ | |||
+ | ===== City names ===== | ||
+ | Names of cities were kept because they cannot be considered as private information and because they may be important for the understanding of the text. | ||
01_corpus/02_preprocessing/01_anonymization.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1