02_browsing:04_queries:03_regex
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
02_browsing:04_queries:02_regex [2019/11/08 09:09] – simone | 02_browsing:04_queries:03_regex [2020/04/21 11:24] – simone | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Regular Expressions ====== | + | ====== |
- | Sometimes, you want to query for different | + | In order to search |
+ | |||
+ | In this section we use the following convention: | ||
+ | * Examples for RegEx are in '' | ||
+ | * Whole queries as used in ANNIS are in '' | ||
+ | * Results of queries are in //italic// | ||
+ | * Individual letters are in pointy brackets, e.g. <a> | ||
- | In order to formulate RegEx expressions in ANNIS, you put your query in between slashes. In the example above, the query might look like /// | ||
===== A (very) short introduction to RegEx ===== | ===== A (very) short introduction to RegEx ===== | ||
Line 9: | Line 14: | ||
As Wikipedia tells us, RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" | As Wikipedia tells us, RegEx takes a pattern of characters you enter into the search field and looks for matches of these characters in the database. Let us assume that the database to be queried is a string of characters like "the man manually attached the tube in Manchester" | ||
- | However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries | + | However, RegEx also allows you to search for such things as alternatives (//man// or //men//), for word boundaries |
==== Case sensitivity ==== | ==== Case sensitivity ==== | ||
- | Your search is case sensitive, i.e. the system does not differentiate between upper and lower case. A query for //MAN// or for //man// or for // | + | Your search is case sensitive, i.e. the system does strictly |
==== Characters, letters and digits==== | ==== Characters, letters and digits==== | ||
- | In RegEx, a character is not the same as a letter. | + | In RegEx, a character is not the same as a letter. |
=== Letters=== | === Letters=== | ||
==Simple== | ==Simple== | ||
- | You can search for every letter or combination thereof in the corpus by just typing it into the search field. | + | You can search for every letter or combination thereof in the corpus by just typing it into the search field. |
Example: | Example: | ||
- | <man> will search for a lowercase <m> followed by a lowercase <a> and a lowercase <n>. | + | The simple query ''" |
Line 29: | Line 34: | ||
Example: | Example: | ||
- | m[aei]n | + | ''/ |
will look for occurrences of either | will look for occurrences of either | ||
- | * man | + | * //man// |
- | * men | + | * //men// |
- | * min | + | * //min// |
==Variable letters== | ==Variable letters== | ||
- | If you are looking for any letter, you can use <\w> (Remember as: word character.), i.e. a backslash followed by a <w>. | + | If you are looking for any letter, you can use '' |
Example: | Example: | ||
- | m\wn will look for (among others): | + | ''/ |
- | * mAn | + | * //mAn// |
- | * mBn | + | * //mBn// |
- | * mCn | + | * //mCn// |
- | * man | + | * //man// |
- | * mbn | + | * //mbn// |
- | * mcn | + | * //mcn// |
- | Something similar can be achieved with [a-z] and [A-Z] respectively. Here you look for occurrences of any letter as well, but this time case sensitive. | + | Something similar can be achieved with '' |
- | E.g <m[A-Z]n> | + | E.g ''/ |
- | This search string can also be reduced to e.g. <[m-q]>to find any letter between | + | This search string can also be reduced to e.g. '' |
- | N.B.: <\w> covers all letters from A to z, i.e. uppercase and lowercase. In our corpus, it also includes | + | N.B.: '' |
Line 66: | Line 71: | ||
Example: | Example: | ||
- | m.n | + | '' |
will look for (among others): | will look for (among others): | ||
- | mAn | + | * //mAn// |
- | mBn | + | * //mBn// |
- | man | + | * //man// |
- | mbn | + | * //mbn// |
- | m&n | + | * //m&n// |
- | m n | + | * //m n// |
- | m_n | + | * //m_n// |
- | m?n | + | * //m?n// |
== Diacritica== | == Diacritica== | ||
- | This corpus is set up so as to recognize umlauts and letters with accents as individuals (Keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). | + | This corpus is set up so as to recognize umlauts and letters with accents as individuals (Keep in mind that this is not the case in many other uses of RegEx. Especially in programs that were developed in the US, a <ü> is not considered as a letter but rather as a boundary). |
=== Digits=== | === Digits=== | ||
- | Just like <\w> above, you can use <\d> (Remember as: digit) | + | Just like '' |
Example: | Example: | ||
- | n\d | + | ''/ |
will look for (among others): | will look for (among others): | ||
- | n0 | + | //n0// |
- | n1 | + | //n1// |
- | n9 | + | //n9// |
Line 96: | Line 101: | ||
==== Separators==== | ==== Separators==== | ||
=== Individual separating characters=== | === Individual separating characters=== | ||
- | Many different characters can occur in between your letters and digits: | + | Many different characters can occur in between your letters and digits: |
* space | * space | ||
* coma | * coma | ||
Line 107: | Line 112: | ||
* exclamation mark (!) | * exclamation mark (!) | ||
- | NB: most of these characters do have a special function as well if they appear in a specific position. As you will see below, { } is one of the possible way to search for repeating characters. Thus, the character { can be recognized as a character in its own right or as a syntactic function depending on its position. The same goes for most of these characters. | + | NB: most of these characters do have a special function as well when they appear in a specific position. As you will see below, { } is one of the possible way to search for repeating characters. Thus, the character |
- | Other separators are reserved by the RegEx syntax. To use them by their ordinary value, | + | Other separators are reserved by the RegEx syntax. To use them by their ordinary value, you have to place a backslash in front of them. Thus, you type in ''/ |
* asterisk (*) | * asterisk (*) | ||
Line 119: | Line 124: | ||
* dollar ($) | * dollar ($) | ||
* caret (^) | * caret (^) | ||
- | Did we forget anything? Well possible. Just type in the character you are wondering about on its own. If you get an error, you have to escape | + | |
+ | In the very probable case this list is not exhaustive, just type in the character you are wondering about. If you get an error, you have to put a backslash in front of it. | ||
===Word boundaries=== | ===Word boundaries=== | ||
- | In ANNIS you can query on different layers. | + | In ANNIS you can query on different layers. |
- | Let us look again at the phrase | + | Let us look again at the sentence |
|the|man|manually|attached|the|tube|in|manchester| | |the|man|manually|attached|the|tube|in|manchester| | ||
Line 132: | Line 139: | ||
|the man manually attached the tube in Manchester| | |the man manually attached the tube in Manchester| | ||
- | Accordingly, | + | Accordingly, |
- | If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters followed by the string //man// followed by any characters" | + | If you query for //man// on the message level, you will find nothing, because ANNIS will search for a whole message that contains only these three characters. In order to actually find the word you are looking for, you have to query for "any characters |
- | msg=/.*?man.*/ | + | '' |
and will find //man// but also // | and will find //man// but also // | ||
- | If you want to find only //man//, you have to query for the three letters surrounded by boundaries (ie. spaces, tabs, fullstops, commas, new-lines etc.). The string for a boundary is //\b//. The query for //man// and only //man// within a message would thus look as follows: | + | If you want to find only //man//, you have to query for the three letters surrounded by boundaries (ie. spaces, tabs, fullstops, commas, new-lines etc.). The string for a boundary is '' |
- | msg=/ | + | '' |
Line 149: | Line 156: | ||
====Quantifiers==== | ====Quantifiers==== | ||
- | Sometimes you might be looking for an expression which can be written with or without repeating letters. E.g. you might want to look for //hallo, haaallo, halooooo// | + | Sometimes you might be looking for an expression which can be written with or without repeating letters |
- | | + | |
- | * ***** an asterisk means a repetition of 0 or more times | + | |
- | | + | |
- | | + | |
Example: | Example: | ||
- | /h+a+l+o+/ | + | '' |
will find all variants of hallo | will find all variants of hallo | ||
- | |||
- | |||
- | Using quantifiers is much more capable and demanding than this. The examples given here are called //greedy//, there are also //non greedy quantifiers// | ||
- | |||
- | Hint: it you find these options too complicated, | ||
==== Alternatives==== | ==== Alternatives==== | ||
- | Above, you have seen that you can search | + | Above, you have seen that you can query for different letters in one spot, e.g. you can search for //man// and //men// with the expression |
- | < | ||
Example: | Example: | ||
- | n(8|acht|ight|uit) | + | '' |
will look for: | will look for: | ||
- | n8 | + | * //n8// |
- | nacht | + | * //nacht// |
- | night | + | * //night// |
- | nuit | + | * //nuit// |
- | </verbatim> | + | |
==== A final word==== | ==== A final word==== | ||
- | What you have read here, is only a selection of the possibilities RegEx offers. To keep things more or less simple for you, we tried to document all the features you are likely to use while omitting everything you probably will not care about. Also, there are different implementations of RegEx in different programs and they support different features | + | What you have read here is only a selection |
02_browsing/04_queries/03_regex.txt · Last modified: 2022/06/27 09:21 by 127.0.0.1