Achim Stein
under construction, version of March 21, 2012
For more information about the corpus go to the SRCMF homepage.
Case 1: Installation using TIGERRegistry
If you have received the corpus in the TIGER XML format, you must use TIGERRegistry to install it. The TIGERSearch manual explains how to do this. If the XML file is in compressed gzip-format (*.xml.gz) you do not have to uncompress it.
Case 2: Copy a registered corpus
If you receive the corpus in a pre-registered format (normally as a
compressed archive, e.g. a *.zip file), unpack this archive and move
the resulting directory into the directory TIGERCorpora of your
TIGERSearch installation (on Windows systems, the default is
C:\TIGERSearch\TIGERCorpora).
If TIGERSearch is running, you may have to re-fresh the corpus tree in TIGERSearch to see the new corpus.
HINT: You may be able to infer much of the TS query language from the examples given in this document. If not, you should study chapter IV of the TS manual in order to get acquainted with the basics of the query language. The TS manual is included in the tool (click on the Help icon); PDF and HTML versions can be found in the doc subfolder in the TIGERSearch installation folder.
You should also consult the description of the SRCMF dependency model in order to get acquainted with the syntactic categories. This document will focus on queries for a limited number of linguistic questions, without however providing detailed explanations of the categories as such.
Please keep in mind that the general strategy here is to use the syntactic annotation, i.e. the tree structure, as much as possible, and to rely as little as possible on the word-level annotation, i.e. part of speech categories and lemmata. This is because the syntactic annotation has been introduced and verified manually, whereas the word-level annotation
Therefore, in some cases you may find that you can formulate more straightforward queries using word-level information, but if you do so be aware of the risk of retrieving erroneous occurrences or, which is more serious, of overlooking occurrences due to annotation errors.
The following figure is a typical exemple for a dependency structure: the dependency between words is expressed by a labelled edge, where the label expresses the type of the relation (fig. 1):
The dependency annotation that can be represented in TIGER XML does not reflect all the details of the original corpus annotation. The TIGER XML format we refer to has been produced by the export function of the Notabene annotation tool (Mazziotta, 2010a,b). Since the current specification of TIGER XML requires words to be terminal nodes, the dependency structure is represented as follows (see fig. 2):
Each node of the graph is an expression between angular brackets. This
expression contains one or more attribute-value pairs, e.g.
cat=Obj for the category Obj or, at the word level,
pos=NOMcom for the part of speech tag. Two nodes can be
concatenated by an operator, e.g. > for dominance in the
syntactic structure, or by . (a dot) for precedence at word
level. TS will ignore comments introduced by double slashes.
In order to find out about the attributes which are available in your corpus, you may consult the corpus information in the left part of the window, or use the preview function, symbolized by the looking glass in the top icons, and hold the mouse over a node, or over a terminal symbol. This will display the information available at that node for a few seconds.
The values of the attributes have to be enclosed either by quotes (for fully specified values) or by slashes (for regular expressions). The following query (at word level) shows how to use attribute value structures, the precedence operator and comments:
// forms starting with "gran" followed by a noun [word=/gran.*/] . [pos=/NOM.*/]
Since an operator combines exactly two node specifications, a series
of three nodes A, B, C has to be splitted in two separate
expressions using the `and' symbol &, e.g. for the precedence
relation: A . B & B . C
Furthermore, variables have to be introduced if both nodes B
are meant to refer to the same word (which is normally intended).
Variables are introduced attaching their name to a node specification
#name:[ ] They can be referred to using their name again
#name The following query finds forms of demander,
followed by a determiner (DET:...) and a noun (NOM).
The second expression re-uses the node labelled #det:
[word=/dist.*/] . #det:[pos=/DET.*/] & #det . [pos=/NOM.*/]
Apart from referring to previous expressions, the variables are also necessary to address the node information in the statistics window.
The following query retrieves nominal phrases Nom governing a modifier ModA. The precedence relation (line 4 or 5, comment out as needed) has to bear on the terminal nodes, which are attached by L-edges (lines 2-3):
#nom:[type="nV"] > #moda:[cat="ModA"] & #moda >L #adj:[] & #nom >L #n:[] // & #adj . #n // modifier precedes noun & #n . #adj // modifier follows noun
HINT: A node specification may remain empty. This is faster than
specifying an arbitrary feature value, e.g. [word=/.*/]. Just
note that in the statistics window, the feature of an unspecified node
will not appear automatically when you click on the 'Default' button.
#nom:[cat="Nom"] > #moda:[cat="ModA"] & #moda >L #adj:[word=/.*/] & #nom >L #n:[word=/.*/] // & #adj . #n // modifier precedes noun & #n . #adj // modifier follows noun
Verb-second (V2): The following query attempts to retrieve
verb-second contexts. It is probably not complete, but it can show
some important strategies for TS queries. First of all, precedence is
hard to specify at phrase level: something like
[cat="Circ"].[cat="VFin"] will not produce the intended result.
A safe solution to this problem is to specify precedence only at
terminal level (for words). We therefore have to determine the
boundaries of a category at word level. To do this, we use the
>@l and >@r operators, which link a node to its leftmost
or rightmost terminal nodes.
Solution 1: top-down. In the following query, we search
for a sentence-initial category node #init_cat [line 4] and
define its boundaries [lines 5-6]: #init_cat directly precedes
the verb if its rightmost terminal node (its last word) directly
precedes the verb [line 7]. We use the >@l operator again to
ensure that the leftmost node of #init_cat is also the leftmost
node of VFin, and thus exclude further preceding categories:
// V2 with verb initial phrases except SjPer, Apst, Ng #vfin:[cat="Snt" & type="VFin"] // In a main clause.. & #vfin >L #v:[] // ...get the verb node... & #vfin > #init_cat:[cat!=/(SjPer|Apst|Ng)/] //... exclude unwanted initial categories & #init_cat >@l #init_cat_init:[] // define the border nodes of #init_cat & #init_cat >@r #init_cat_end:[] & #init_cat_end . #v // The last word of #init_cat precedes the verb & #vfin >@l #init_cat_init // The first node of VFin is the left border of #init_cat
The previous query is an example for a top-down approach: it specifies the nodes under VFin and goes down to the word level. However, it misses occurrences where the initial category is discontinuous, i.e. only part of the category precedes the verb, as in:
Mais Dex plevis ma loiauté Qui sor mon cors mete flaele S' onques fors cil qui m' ot pucele Out m' amistié encor nul jor.
To include these cases, it is safer to procede bottom-up, i.e. determine the word which precedes the verb (here: Dex), and go up to its topmost dominating category under VFin.
Solution 2: bottom-up.
In the following query, we do the following: (a) specify the path from
Snt over VFin to the verb, lines 1-2; (b) determine
the word which precedes the verb, line 3, and (c) specify the path up
to the highest category #init_cat which dominates it under
VFin, excluding the categories for subjects, parenthesis and
negation which are not relevant for V2 contexts, lines 4-5. Finally we
ensure that there is no preceding category by stating that the
leftmost node of the VFin, i.e. its first word, is also the
leftmost node of #init_cat, lines 6-7. Note that this does not
exclude sentence-initial conjunctions, like Et..., since thy
are governed by Snt.
// V2 with verb initial phrases except SjPer, Apst, Ng #vfin:[cat="Snt" & type="VFin"] // In a main clause.. & #vfin >L #v:[] // ...get the verb node... & #before_v:[] . #v // ...then the word preceding the verb... & #init_cat:[cat!=/(SjPer|Apst|Ng)/] >* #before_v //...and its category... & #vfin > #init_cat // ...directly under VFin (#init_cat). & #vfin >@l #vfin_init:[] // VFin-initial, if the leftmost node of VFin... & #init_cat >@l #vfin_init // ...is also the leftmost node of #init_cat
Note that in a sample of 3.300 sentences, the bottom-up solution gives us 90 more occurrences of V2 than the top-down solution.
V3 structures are often said to be counter-evidence for the V2
hypothesis. The following query searches for any kind of
sentence-initial phrase (XP: anything except Apst), followed
by the Subject in second position and the verb in third
position. First, we specify the phrases which depend the verb node
VFin. Then the >@l and >@r operators are used
to determine phrase boundaries, and these boundaries are used to
define the linear precedence of the phrases. Again, the last
statement ensures that XP is the first phrase:
// V3 XP-Subject-Verb #vfin:[cat="Snt" & type="VFin"] // In a main clause.. & #vfin >L #v:[] & #vfin > #subject:[cat="SjPer"] & #vfin > #init_cat:[cat!="Apst"] // exclude Apst // determine the left and right phrase boundaries & #subject >@l #subject_init:[] // first word of Subject & #subject >@r #subject_end:[] // last word of Subject... & #init_cat >@l #init_cat_init:[] & #init_cat >@r #init_cat_end:[] // define the precedence using the boundaries & #subject_end . #v & #init_cat_end . #subject_init & #vfin >@l #init_cat_init
A very similar query can be used to search for verb-final structures. The following query retrieves sentences with the order SOV. Here, category boundaries are specified for both the subject and the object:
// Subject-Object-Verb final #vfin:[cat="Snt" & type="VFin"] // In a main clause.. & #vfin >@r #v:[] & #vfin > #subject:[cat="SjPer"] & #vfin > #object:[cat="Obj"] // determine the left and right phrase boundaries & #subject >@l #subject_init:[pos!=/PROper.*/] // first word of Subject & #subject >@r #subject_end:[] // last word of Subject... & #object >@l #object_init:[pos!=/PROper.*/] & #object >@r #object_end:[] // define the precedence using the boundaries & #subject_end . #object_init & #object_end . #v
Compare with a MCVF query:
// V2 after 1-word or constituent
#ip:[cat="IP"]
& ( #ip > #pos1:[pos!=/(PON|NEG)/]
| #ip > #xp1:[cat="PP"] & #xp1 >@r #pos1:[word!=/[\*\,].*/] )
& #ip > #v:[pos=/(V.*)/]
& #ip >SBJ #sbj:[cat="NP"]
& #pos1 .* #v // leave room for clitics
& #v .* #sbj
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -dir latex2html -show_section_numbers -local_icons -style=tigersearch-srcmf.css tigersearch-srcmf.tex
The translation was initiated by Achim Stein on 2012-03-21