next_inactive up previous


Querying the Syntactic Reference Corpus of Medieval French (SRCMF) with TIGERSearch

Achim Stein

under construction, version of March 21, 2012

Abstract:

TIGERSearch is a software for querying syntactically annotated text corpora (tree graphs) and was developed at the Institut für Maschinelle Sprachverarbeitung (IMS) at the University of Stuttgart (Lezius, 2002). It includes a complete manual which however focuses mostly on phrase structure grammars. This document is a guide for querying corpora annotated with the SRCMF dependency model.

For more information about the corpus go to the SRCMF homepage.


Contents

1 Introduction

1.1 Installation of TIGERSearch corpora

Case 1: Installation using TIGERRegistry

If you have received the corpus in the TIGER XML format, you must use TIGERRegistry to install it. The TIGERSearch manual explains how to do this. If the XML file is in compressed gzip-format (*.xml.gz) you do not have to uncompress it.

Case 2: Copy a registered corpus

If you receive the corpus in a pre-registered format (normally as a compressed archive, e.g. a *.zip file), unpack this archive and move the resulting directory into the directory TIGERCorpora of your TIGERSearch installation (on Windows systems, the default is C:\TIGERSearch\TIGERCorpora).

If TIGERSearch is running, you may have to re-fresh the corpus tree in TIGERSearch to see the new corpus.

HINT: You may be able to infer much of the TS query language from the examples given in this document. If not, you should study chapter IV of the TS manual in order to get acquainted with the basics of the query language. The TS manual is included in the tool (click on the Help icon); PDF and HTML versions can be found in the doc subfolder in the TIGERSearch installation folder.

1.2 The SRCMF dependency model

You should also consult the description of the SRCMF dependency model in order to get acquainted with the syntactic categories. This document will focus on queries for a limited number of linguistic questions, without however providing detailed explanations of the categories as such.

Please keep in mind that the general strategy here is to use the syntactic annotation, i.e. the tree structure, as much as possible, and to rely as little as possible on the word-level annotation, i.e. part of speech categories and lemmata. This is because the syntactic annotation has been introduced and verified manually, whereas the word-level annotation

Therefore, in some cases you may find that you can formulate more straightforward queries using word-level information, but if you do so be aware of the risk of retrieving erroneous occurrences or, which is more serious, of overlooking occurrences due to annotation errors.

1.3 Dependency graphs vs TIGER graphs

The following figure is a typical exemple for a dependency structure: the dependency between words is expressed by a labelled edge, where the label expresses the type of the relation (fig. 1):

Figure 1: A dependency structure
Image heads

The dependency annotation that can be represented in TIGER XML does not reflect all the details of the original corpus annotation. The TIGER XML format we refer to has been produced by the export function of the Notabene annotation tool (Mazziotta, 2010a,b). Since the current specification of TIGER XML requires words to be terminal nodes, the dependency structure is represented as follows (see fig. 2):

Figure 2: SRCMF dependency structures in TIGER
Image dep-tiger3


2 Basics of the TIGERSearch query language

Each node of the graph is an expression between angular brackets. This expression contains one or more attribute-value pairs, e.g. cat=Obj for the category Obj or, at the word level, pos=NOMcom for the part of speech tag. Two nodes can be concatenated by an operator, e.g. > for dominance in the syntactic structure, or by . (a dot) for precedence at word level. TS will ignore comments introduced by double slashes.

In order to find out about the attributes which are available in your corpus, you may consult the corpus information in the left part of the window, or use the preview function, symbolized by the looking glass in the top icons, and hold the mouse over a node, or over a terminal symbol. This will display the information available at that node for a few seconds.

The values of the attributes have to be enclosed either by quotes (for fully specified values) or by slashes (for regular expressions). The following query (at word level) shows how to use attribute value structures, the precedence operator and comments:

// forms starting with "gran" followed by a noun
[word=/gran.*/] . [pos=/NOM.*/]

Since an operator combines exactly two node specifications, a series of three nodes A, B, C has to be splitted in two separate expressions using the `and' symbol &, e.g. for the precedence relation: A . B & B . C

Furthermore, variables have to be introduced if both nodes B are meant to refer to the same word (which is normally intended). Variables are introduced attaching their name to a node specification #name:[ ] They can be referred to using their name again #name The following query finds forms of demander, followed by a determiner (DET:...) and a noun (NOM). The second expression re-uses the node labelled #det:

[word=/dist.*/] . #det:[pos=/DET.*/]
& #det . [pos=/NOM.*/]

Apart from referring to previous expressions, the variables are also necessary to address the node information in the statistics window.


3 TIGERSearch queries for SRCMF annotation


3.1 The position of adjectives in the noun phrase

The following query retrieves nominal phrases Nom governing a modifier ModA. The precedence relation (line 4 or 5, comment out as needed) has to bear on the terminal nodes, which are attached by L-edges (lines 2-3):

#nom:[type="nV"] > #moda:[cat="ModA"]
& #moda >L #adj:[]
& #nom >L #n:[]
// & #adj . #n  // modifier precedes noun
& #n . #adj     // modifier follows noun

HINT: A node specification may remain empty. This is faster than specifying an arbitrary feature value, e.g. [word=/.*/]. Just note that in the statistics window, the feature of an unspecified node will not appear automatically when you click on the 'Default' button.

#nom:[cat="Nom"] > #moda:[cat="ModA"]
& #moda >L #adj:[word=/.*/]
& #nom >L #n:[word=/.*/]
// & #adj . #n  // modifier precedes noun
& #n . #adj     // modifier follows noun


3.2 Order of categories

Verb-second (V2): The following query attempts to retrieve verb-second contexts. It is probably not complete, but it can show some important strategies for TS queries. First of all, precedence is hard to specify at phrase level: something like [cat="Circ"].[cat="VFin"] will not produce the intended result.

A safe solution to this problem is to specify precedence only at terminal level (for words). We therefore have to determine the boundaries of a category at word level. To do this, we use the >@l and >@r operators, which link a node to its leftmost or rightmost terminal nodes.

Solution 1: top-down. In the following query, we search for a sentence-initial category node #init_cat [line 4] and define its boundaries [lines 5-6]: #init_cat directly precedes the verb if its rightmost terminal node (its last word) directly precedes the verb [line 7]. We use the >@l operator again to ensure that the leftmost node of #init_cat is also the leftmost node of VFin, and thus exclude further preceding categories:

// V2 with verb initial phrases except SjPer, Apst, Ng
#vfin:[cat="Snt" & type="VFin"]  // In a main clause..
& #vfin >L #v:[] // ...get the verb node...
& #vfin > #init_cat:[cat!=/(SjPer|Apst|Ng)/]  //... exclude unwanted initial categories
& #init_cat >@l  #init_cat_init:[]  // define the border nodes of #init_cat
& #init_cat >@r  #init_cat_end:[]
& #init_cat_end . #v  // The last word of #init_cat precedes the verb
& #vfin >@l  #init_cat_init  // The first node of VFin is the left border of #init_cat

The previous query is an example for a top-down approach: it specifies the nodes under VFin and goes down to the word level. However, it misses occurrences where the initial category is discontinuous, i.e. only part of the category precedes the verb, as in:

Mais Dex plevis ma loiauté Qui sor mon cors mete flaele S' onques fors cil qui m' ot pucele Out m' amistié encor nul jor.

To include these cases, it is safer to procede bottom-up, i.e. determine the word which precedes the verb (here: Dex), and go up to its topmost dominating category under VFin.

Solution 2: bottom-up. In the following query, we do the following: (a) specify the path from Snt over VFin to the verb, lines 1-2; (b) determine the word which precedes the verb, line 3, and (c) specify the path up to the highest category #init_cat which dominates it under VFin, excluding the categories for subjects, parenthesis and negation which are not relevant for V2 contexts, lines 4-5. Finally we ensure that there is no preceding category by stating that the leftmost node of the VFin, i.e. its first word, is also the leftmost node of #init_cat, lines 6-7. Note that this does not exclude sentence-initial conjunctions, like Et..., since thy are governed by Snt.

// V2 with verb initial phrases except SjPer, Apst, Ng
#vfin:[cat="Snt" & type="VFin"]  // In a main clause..
& #vfin >L #v:[] // ...get the verb node...
& #before_v:[] . #v  // ...then the word preceding the verb...
& #init_cat:[cat!=/(SjPer|Apst|Ng)/] >* #before_v  //...and its category...
& #vfin > #init_cat  // ...directly under VFin (#init_cat).
& #vfin >@l  #vfin_init:[]  // VFin-initial, if the leftmost node of VFin...
& #init_cat >@l  #vfin_init   // ...is also the leftmost node of #init_cat

Note that in a sample of 3.300 sentences, the bottom-up solution gives us 90 more occurrences of V2 than the top-down solution.

V3 structures are often said to be counter-evidence for the V2 hypothesis. The following query searches for any kind of sentence-initial phrase (XP: anything except Apst), followed by the Subject in second position and the verb in third position. First, we specify the phrases which depend the verb node VFin. Then the >@l and >@r operators are used to determine phrase boundaries, and these boundaries are used to define the linear precedence of the phrases. Again, the last statement ensures that XP is the first phrase:

// V3  XP-Subject-Verb
#vfin:[cat="Snt" & type="VFin"]  // In a main clause..
& #vfin >L #v:[] 
& #vfin > #subject:[cat="SjPer"]
& #vfin > #init_cat:[cat!="Apst"]  // exclude Apst
// determine the left and right phrase boundaries
& #subject >@l  #subject_init:[]  // first word of Subject
& #subject >@r  #subject_end:[]  // last word of Subject...
& #init_cat >@l  #init_cat_init:[]
& #init_cat >@r  #init_cat_end:[]
// define the precedence using the boundaries
& #subject_end . #v
& #init_cat_end . #subject_init 
& #vfin >@l #init_cat_init

A very similar query can be used to search for verb-final structures. The following query retrieves sentences with the order SOV. Here, category boundaries are specified for both the subject and the object:

// Subject-Object-Verb final
#vfin:[cat="Snt" & type="VFin"]  // In a main clause..
& #vfin >@r #v:[] 
& #vfin > #subject:[cat="SjPer"]
& #vfin > #object:[cat="Obj"]
// determine the left and right phrase boundaries
& #subject >@l  #subject_init:[pos!=/PROper.*/]  // first word of Subject
& #subject >@r  #subject_end:[]  // last word of Subject...
& #object >@l  #object_init:[pos!=/PROper.*/]
& #object >@r  #object_end:[]
// define the precedence using the boundaries
& #subject_end . #object_init 
& #object_end . #v

Compare with a MCVF query:

// V2 after 1-word or constituent
#ip:[cat="IP"]
& ( #ip > #pos1:[pos!=/(PON|NEG)/]
     | #ip > #xp1:[cat="PP"] & #xp1 >@r #pos1:[word!=/[\*\,].*/] )
& #ip > #v:[pos=/(V.*)/]
& #ip >SBJ #sbj:[cat="NP"] 
& #pos1 .* #v // leave room for clitics
& #v .* #sbj

Bibliography

Lezius 2002 LEZIUS, Wolfgang:
Ein Suchwerkzeug für syntaktisch annotierte Textkorpora (German).
Stuttgart : Institut für Maschinelle Sprachverarbeitung (IMS), 2002
(University of Stuttgart Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (AIMS), vol. 8, no. 4)

Mazziotta 2010a MAZZIOTTA, Nicolas:
Building the 'Syntactic Reference Corpus of Medieval French' Using NotaBene RDF Annotation Tool.
In: Proceedings of the 4th Linguistic Annotation Workshop (LAW IV), URL www.aclweb.org/anthology/W/W10/W10-1820.pdf, 2010

Mazziotta 2010b MAZZIOTTA, Nicolas:
Logiciel NotaBene pour l'annotation linguistique. Annotations et conceptualisations multiples.
In: Recherches qualitatives. Hors-série
9 (2010), S. 83-94

About this document ...

Querying the Syntactic Reference Corpus of Medieval French (SRCMF) with TIGERSearch

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -dir latex2html -show_section_numbers -local_icons -style=tigersearch-srcmf.css tigersearch-srcmf.tex

The translation was initiated by Achim Stein on 2012-03-21


next_inactive up previous
Achim Stein 2012-03-21