User Tools

Site Tools


02_browsing:01_sub_corpora

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
02_browsing:01_sub_corpora [2020/01/06 16:55] simone02_browsing:01_sub_corpora [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== 2.1 Sub-corpora ====== ====== 2.1 Sub-corpora ======
-As explained in the [[01_corpus:subcorpora|section]] about the creation of the sub-corpora, you can work with either the full corpus WUS or you can select different sub-corpora that mainly depend on the main language within the chat. You find the list of sub-corpora in the bottom left in ANNIS.+As explained in [[01_corpus:01_subcorpora|section 1.1]], you can work with either the full corpus WUS or you can select different sub-corpora. You find the list of sub-corpora in the bottom left in ANNIS. 
  
 The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics. The list of sub-corpora is also a good starting point to get information about available fields for your query, to get examples and statistics.
 +
 +Please keep in mind that you also see corpora with lowercase letters in the browser (e.g. deu-rftagged, ita-tagged, roh etc.). These corpora contain data from our [[https://wiki.linguistik.uzh.ch/sms4science|SMS project]].
  
 ===== Tokens and messages per sub-corpus ===== ===== Tokens and messages per sub-corpus =====
 Next to the name of each sub-corpus, you see the number of messages (marked as "Texts") and tokens. You can use these figures for statistics. Next to the name of each sub-corpus, you see the number of messages (marked as "Texts") and tokens. You can use these figures for statistics.
  
-**Please note**: If you work with corpora where not all participants gave their [[01_corpus:02_preprocessing:02_without_permission|permission]] to use their texts, the figure for tokens is off because messages without permission were replaced by texts like //redactedQ12tokens55characters //. These texts count as tokens, too. If you need statistics that depend on the number of tokens in a (sub-)corpus, you are advised to work with corpora with the extension [[01_corpus:subcorpora|_DEMOG]].+**Please note**: If you work with corpora where not all participants gave their [[01_corpus:02_preprocessing:02_without_permission|permission]] to use their messages, the figure for tokens is off because messages without permission were replaced by messages like //redactedQ12tokens55characters //. These texts count as tokens, too. If you need statistics that depend on the number of tokens in a (sub-)corpus, you are advised to work with corpora with the extension [[01_corpus:01_subcorpora|_DEMOG]].
  
-===== Information about the corpus ===== +===== Information about the (sub-)corpora ===== 
-When you press on the small <ifor information to the right of each (sub-)corpus name, you receive more information about the corpus. More specifically: +When you press on the small ''i'' for information to the right of each (sub-)corpus name, you find more information about the corpus. More specifically: 
-  * Some statistic information about the corpus including a link at the bottom for reference+  * Some statistic information about the sub-corpus including an ULR pointing to this sub-corpus at the bottom.
   * Information about the version to be quoted in publications.   * Information about the version to be quoted in publications.
-  * If you need specific information about an individual chat, you can select the chat instead of the corpus in the top left to get information such as number of messages, number of speakers, etc. This is also an easy way to see which chats are integrated in this sub-corpus.+  * If you need specific information about an individual chat, you can select the chat instead of the sub-corpus in the top left to get information such as number of messages, number of speakers, etc. This is also an easy way to see which chats are integrated in this sub-corpus.
  
 {{ :02_browsing:annotations.png?400 |}} {{ :02_browsing:annotations.png?400 |}}
-Figure 1: annotations for a (sub-)corpus+Figure 1: Information about a (sub-)corpus
  
-On the right side of the information window, you see which annotations are available to be queried for the selected sub-corpus. +On the right-hand side of the information window, you see which annotations are available to be queried for the selected sub-corpus. 
-    * You have two categories of information: Node Annotations are attributes on either token and message level, that we considered to be basic units. Meta Annotations contain information at the chat level; most of the meta annotations indicate sizes (e.g. total number of messages in a given chat) and were automatically computed. +    * You have two categories of information: Node Annotations are attributes on either token and message level. Meta Annotations contain information at the chat level; most of the meta annotations indicate sizes (e.g. total number of messages in a given chat) and were automatically computed. 
-    * To the right of the name of the annotation, there is an example query for that specific annotation. If you click on that text, a sample query is entered into the query field in the main screen. This is the easiest way to generate queries, since you can always modify it in the query field. Example: if you click on Lang_100_and_more, an example like //node & meta::lang_100_and_more="fra, eng"// is entered into the query field. This query would search for messages in chats with more than 100 messages in French and in English. More precisely: "nodewill fetch also all tokens that are in such chats; if you want to distinguish between messages and tokens, you should explicitly query for one or the other: tok & … or msg & ….+    * To the right of the name of the annotation, there is an example query for that specific annotation. If you click on that text, a sample query is entered into the query field in the main screen. This is the easiest way to generate queries, since you can always modify it in the query field. Example: if you click on Lang_100_and_more, an example like ''node & meta::lang_100_and_more="fra, eng"'' is entered into the query field. This query would search for messages in chats with more than 100 messages in French and in English. More precisely: ''node'' will fetch also all tokens that are in such chats; if you want to distinguish between messages and tokens, you should explicitly query for one or the other: ''tok & …'' or ''msg & …''.
  
 ===== List of chats in the sub-corpus ===== ===== List of chats in the sub-corpus =====
-By clicking on the little piece of paper next to the information <iin the list of sub-corpora, you get a list of all chats in this specific sub-corpus.+By clicking on the little piece of paper next to the information ''i'' in the list of sub-corpora, you get a list of all chats in the respective sub-corpus.
  
-From here, you can click on //complete chat view// to view this whole chat (without any annotations). once in this list of messages, you can alway click on an individual message ID to see that message with annotations.+From here, you can click on ''complete chat view'' to view the whole chat (without any annotations). Once in this list of messages, you can alway click on an individual message ID to see that message with its annotations.
  
-If you click on the little <iat the very right of the list of chats, you see all the meta information about this chat.+If you click on the little ''i'' at the very right of the list of chats, you see all the meta information about the respective chat.
  
02_browsing/01_sub_corpora.1578326139.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki