Differences

This shows you the differences between two versions of the page.

--- corpus:02_preprocessing:01_anonymization [2019/09/22 13:30] – simone
+++ 01_corpus:02_preprocessing:01_anonymization [2020/04/16 13:38] – simone
@@ Line 1: / Line 1: @@
-====== Anonymization ======
+====== 1.2.1 Anonymization ======
-The data in the corpus was anonymized my means with the same methodology that we already applied successfully in the [[http://www.sms4science.ch|SMS corpus]].
 ===== General privacy =====
-While the project did not have the intention of collecting private information about the informants (other than what they provided in the questionnaire), it could still not be assumed that the informants would not sent personal information in their SMS, so it was the team's task to remove specific pieces of information again. These steps were performed by means of computational linguistics. The stories told, however might still allow you to recognize individual informants. If that is the case, we asked to comply with common [[https://en.wikipedia.org/wiki/Research#Research_ethics|research ethics]] and keep that knowledge to yourself.
+People who write WhatsApp chats can be recognized either by the stories they tell or by the names and places they mention. While we had no way to address the former, we addressed the latter by means of computational linguistics. If you happen to recognize informants based on the remaining information, we ask you to comply with common [[https://en.wikipedia.org/wiki/Research#Research_ethics|research ethics]] and keep that knowledge to yourself.
 ===== First names =====
-A reference list of first names in different languages was used to remove all first names. As always with such a task, it was a balance act between precision and recall. On the one hand, all first names should be removed from the data, on the other hand no information that is homograph to a first name should get lost.
+All first names found are based on freely available reference lists in the respective language. It was then decided to not actually remove first names, but to rotate them, meaning that the name Peter in a chat would not get replaced by e.g. [FirstName], but by e.g. Ferdinand. This procedure has several advantages:
-To get the best possible result, it was decided to not actually remove first names, but to rotate them, meaning that the name Peter in an SMS would not get replaced by e.g. [FirstName], but by e.g. Ferdinand. This procedure has several advantages:
-The text remains easy to read.
-Because Peter is always replaced with Ferdinand, all occurrences of the same name remain the same. Conversations can therefor easier be recognized as such.
-Names that did not get replaced because of homography are not recognizable as such, i.e. if the name Minna appears in an SMS, nobody can know, whether this is a replaced name or whether it is a name that was not replaced because it is a homograph to some rare word in Romansh. The scientist working with the data will therefor always assume that first names he comes across are actually not the real names used in the SMS.
-Tests show, that more than 95% of all first names were in fact removed.
-===== Lastnames =====
+  - The text remains easy to read.
-Only very few last names can in fact be found in the data. Because of this limitation, the same procedure as with first names could not be applied, because additionally some of the last names used are very rare if not unique. It was therefor decided to replace all last names with [LastName] instead. In a combined effort of manually analyzing and means of computer linguistics, more than 95% of all last names were removed.
+  - Because Peter is always replaced with Ferdinand, all occurrences of the same name remain the same. Conversations can therefore be more easier followed.
-Numbers
-In an effort to remove information about phone numbers, bank accounts etc., all numbers with three and more digits where removed and each digit was replaced with one N. The phone number 079 987 65 43 would thus become NNN NNN 65 43, while 0799876543 would be NNNNNNNNNN. Reliability here lies with 100%.
+Tests showed, that more than 95% of all first names were found and rotated in this way.
+We tried to assign the same sex to the rotated names as to the original one, such as to keep the text readable. Very often, this resulted in good replacements, but some names are gender-neutral. //Andrea// e.g. is a female name in German but a male one in Italian and Romansh. //Alex// can be Alexander or Alexandra etc. So you might come across a message, where the male name //Andrea// was replaced by //Olivia//. This will result in weird sentences like //I saw Olivia yesterday, he wasn't looking well//. If you keep in mind that the names were rotated, you will not have a problem realizing that this is a man with a wrongly assigned replacement name.
+===== Last names =====
+Only very few last names can in fact be found in the data. It was decided to replace all last names with [LastName].
+===== Numbers =====
+In order to remove information about phone numbers, bank accounts etc., all numbers with three and more digits where removed and each digit was replaced with one N. Reliability here lies at 100%.
 ===== E-Mail addresses =====
-All email adresses were removed and replaced with xxx@yyy.ch, while keeping the number of characters. info@uzh.ch would therefore become xxxx@yyy.ch, while admin@google.com would become xxxxx@yyyyyy.com.
+All email addresses were removed and replaced with xxx@yyy.ch, while keeping the number of characters. info@uzh.ch would therefore become xxxx@yyy.ch, while admin@google.com would become xxxxx@yyyyyy.com.
 ===== Street addresses =====