01_corpus:02_preprocessing:03_emojis
This is an old revision of the document!
Emojis
Emojis are characters in Unicode. The application WhatsApp uses special fonts such as to have the same appearance of emojis on all operation systems. In our corpus browsers, emojis can be displayed, but they are represented in the font that is used by the user, thus, it cannot be guarantied that an emoji in the original text looked as it does on your screen.
Querying emojis might not always be easy. We therefor decided to encode them in texts. This emoji 😺 would e.g. become emojiQsmilingcatfacewithopenmouth. This encoding system allows for easily finding individual or groups of emojis using Regular Expressions, e.g.:
- emojiQ.* finds all all emojis
01_corpus/02_preprocessing/03_emojis.1572791333.txt.gz · Last modified: 2022/06/27 09:21 (external edit)