User Tools

Site Tools


1.2.6 Part of Speech Tagging

Some sub-corpora have been annotated with Part Of Speech annotations. This concerns WUS_DIALOG_GSW, WUS_FRA, WUS_FRA_DEMOG, WUS_ITA, WUS_ITA_DEMOG.


The whole French corpus has been annotated with (Modified French TreeBank) using the tag set . Available annotations are "mftb_pos" (for part of speech) and "mftb_lem" (for the lemma). The following tags are used:

  • adjective
  • interrogative adjective
  • adverb
  • interrogative adverb
  • coordinating conjunction
  • object clitic pronoun
  • reflexive clitic pronoun
  • subject clitic pronoun
  • subordinating conjunction
  • determiner
  • interrogative determiner
  • foreign word
  • interjection
  • common noun
  • proper noun
  • preposition
  • preposition+determiner amalgam
  • prepositon+pronoun amalgam
  • punctuation mark
  • prefix
  • full pronoun
  • relative pronoun
  • interrogative pronoun
  • indicative or conditional verb form
  • imperative verb form
  • infinitive verb form
  • past participle
  • present participle
  • subjunctive verb form

Swiss German dialect

Five chats of the Swiss German dialectal data (34,683 tokens) have been manually normalized and annotated for Part of Speech. The according corpus is called WUS_DIALOG_GSW. Three annotations have been added to each token:

  • gloss: The manual normalization
  • tt_pos: Part of Speech annotation with based on the manually normalized tokens.
  • tt_lem: The lemma as assigned by TreeTagger

The uses the following tags:

  • attributive adjective (including participles used adjectivally)
  • predicate adjective; adjective used adverbially
  • adverb (never used as attributive adjective)
  • preposition left hand part of double preposition
  • preposition with fused article
  • postposition
  • right hand part of double preposition
  • article (definite or indefinite)
  • cardinal number (words or figures); also declined
  • foreign words (actual part of speech in original language may be appended, e.g. FMADV/ FM-NN)
  • interjection
  • co-ordinating conjunction
  • comparative conjunction or particle
  • preposition used to introduce infinitive clause
  • subordinating conjunction
  • adjective used as noun
  • names and other proper nouns
  • noun (but not adjectives used as nouns)
  • pronominal adverb
  • pronominal adverb used as relative
  • demonstrative determiner
  • demonstrative pronoun
  • indefinite determiner (whether occurring on its own or in conjunction with another determiner)
  • indefinite pronoun
  • personal pronoun
  • reflexive pronoun
  • possessive pronoun
  • possessive determiner
  • relative depending on a noun
  • relative pronoun (i.e. forms of der or welcher)
  • particle with adjective or adverb
  • answer particle
  • negative particle
  • indeclinable relative particle
  • separable prefix
  • infinitive particle zu
  • interrogative pronoun
  • interrogative determiner
  • interrogative adverb
  • interrogative adverb used as relative
  • interrogative pronoun used as relative
  • truncated form of compound
  • finite auxiliary verb
  • imperative of auxiliary
  • infinitive of auxiliary
  • past participle of auxiliary
  • finite modal verb
  • infinitive of modal
  • past participle of auxiliary
  • finite full verb
  • imperative of full verb
  • infinitive of full verb
  • infinitive with incorporated zu
  • past participle of full verb

As in the French corpus, there are also combined tags such as VAFIN+PPER when a personal pronoun is agglutinated to a verb (hätti for 'hätte ich').


The Italian corpus is annotated with the , too, but based on the original tokens, i.e. not manually normalized.

  • tt_pos: Part of Speech annotation with TreeTagger
  • tt_lem: The lemma as assigned by TreeTagger

The following PoS was used:

  • abbreviation
  • adjective
  • adverb
  • conjunction
  • definite article
  • indefinite article
  • foreign word
  • interjection
  • list symbol
  • noun
  • name
  • numeral
  • punctuation
  • preposition
  • preposition+article
  • pronoun
  • demonstrative pronoun
  • indefinite pronoun
  • interrogative pronoun
  • personal pronoun
  • possessive pronoun
  • reflexive pronoun
  • relative pronoun
  • sentence marker
  • symbol
  • verb conjunctive imperfect
  • verb conditional
  • verb conjunctive present
  • verb future tense
  • verb gerund
  • verb imperative
  • verb imperfect
  • verb infinitive
  • verb participle perfect
  • verb participle present
  • verb present
  • verb reflexive infinitive
  • verb simple past
01_corpus/02_preprocessing/06_pos.txt · Last modified: 2022/06/27 09:21 (external edit)