Contents of this chapter:
This chapter gives tips on a number of common problems and errors that arise when using CorpusSearch. The reader is assumed to have a general familiarity with the rest of the CorpusSearch manual. Many of the example queries assume a standard definition file containing definitions for at least finite_verb and non_finite_verb.
The following are useful definitions to include in a definition file for the Middle and Early Modern English corpora:
finite_verb: *MD|*HVP|*HVD|*DOP|*DOD|*BEP|*BED|*VBP|*VBD non_finite_verb: *VB|V*N|*HV|H*N|*DO|D*N|*BE|BEN non-pronominal_NP: *N*|D*|Q*|ADJ*|CONJ*|*ONE*|*OTHER*|CP*A common error is to forget to use the define command to specify the definition file when using definitions. No error message will be generated, but the search will result in no output.
Be liberal in using *. Using NP-SBJ as a search term will only find a subset of subjects. Some subjects are resumptive (NP-SBJ-RSP), some are coindexed to a clause, or to trace in a lower clause (NP-SBJ-1), some may have other additional labels. Using NP-SBJ* will find all the subjects labelled in this way, no matter what might be added on to the end of the label. In general, only leave off the * if you are sure you don't want it.
When you want to refer to all the labels referred to by, for instance, ADVP*, except one, you have to use a list and list all the options you are interested in, as for instance ADVP|ADVP-LOC|ADVP-TMP (this omits ADVP-DIR which would be included in ADVP*). This is what definition files are for; you only have to write the complex disjunction once.
Note that if you want to refer to an actual * in a search (all traces start with *), escape it with a backslash \ . The following query finds subjects which dominate traces. The first * in \** is escaped and thus refers to an actual *, while the second is not and thus matches anything that follows the *; this will match, for instance, *con*, *exp*, *T-1* and others.
query: (NP-SBJ* iDoms \**)
A common error is to overuse the exists function. Using a search term forces that term to exist in any hit; it is not necessary to specify this separately. Thus the following is an inefficient query, although it is not ill-formed.
query: ((NP-SBJ* exists) AND (IP* iDoms NP-SBJ*))The second clause of the query alone will accomplish the same thing more quickly and efficiently.
Same instance works by literal string match of arguments to functions. Thus NP-SBJ does not match NP-SBJ*, and MD|VBD does not match VBD|MD; that is, in neither case would same instance be invoked between the two terms.
When two search arguments do match, they are forced to apply to the same node. Thus two uses of NP-OB* will require that, if for instance, NP-OB2 is found as an instance of the first NP-OB*, then the next use of NP-OB* will also apply to the same NP-OB2 (not, for instance, an NP-OB1 which may also be in the vicinity).
When two search terms do not match but might refer to the same node, as for instance, NP-SBJ and NP-SBJ*, or MD|VBD and VBD|MD, same instance is not forced, but neither is it ruled out; that is, the two label strings in the query may or may not wind up referring to the same node in the corpus.
In order to force non-same instance, use index numbers. [1]NP-SBJ* and [2]NP-SBJ* cannot apply to the same NP-SBJ* node.
A common error is to forget that impossible (to the linguist) cases of same instance will nonetheless be interpreted this way by CorpusSearch. Thus, for instance, a query such as the following will produce no results:
query: ((NP-SBJ* iDoms PRO) AND (NP-OB1* iDoms PRO))Although it is impossible for these PROs to refer to the same node, since they are dominated by different nodes, CorpusSearch will assume they do, and consequently will find no matches. Traces and zeros also need to be differentiated, as in the following:
query: ((MD iDoms [1]!\**) AND (VB iDoms [2]!\**))or
query: ((WNP iDoms [1]0) AND (C iDoms [2]0))An easier way to accomplish the former is to add traces to the ignore list.
A default "ignore list" is supplied with CorpusSearch. It contains such things as punctuation and various meta labels that are not part of the text. If you want to search for punctuation, for instance, or line breaks, then you must provide your own ignore list which does not include the items you want to be able to access.
Although the ignore list is primarily a way to avoid non-text annotations, linguistic labels can also be added to the ignore list, in which case CorpusSearch will simply act as if they are not there. Thus for instance, if you add NEG to the ignore list, you can find cases in which nothing but possibly negation intervenes between the subject and the finite verb.
add_to_ignore: NEG query: (NP-SBJ* iPrecedes finite_verb)This will find the following two sentences:
Arthur loves Guinevere
Arthur ne loves Guinevere
but not:
Arthur madly loves Guinevere
Using the ignore list is also helpful in looking for V2. In many cases, the verb is not technically the second node in the IP because of initial conjunction. Adding CONJ (and possibly some other things, such as INTJ*, and NP-VOC) to the ignore list will solve this problem (or at least reduce it). The query below will find all the following:
The sword desired Lancelot
And the sword desired Lancelot
Gramercy, Arthur, the sword desired Lancelot
add_to_ignore: INTJ*|NP-VOC|CONJ query: ((IP* iDomsNumber 1 NP-OB*) AND (IP* iDomsNumber 2 finite_verb))
Traces (which all start with * in the PPCME2) are treated as text by CorpusSearch, and thus can be searched for. In order to differentiate the * which means "match anything" from the * that is part of the text of a trace, use \* to refer to the latter. The string \** will match any trace.
In the more common case, in which you want to simply ignore traces, add them to the ignore list as follows:
add_to_ignore: \**This means that any node that contains a trace will not be found. Thus a query such as (NP* exists) will not find any NPs which contain only traces.
Do not search for non-pronominal NPs with the following query:
(NP* iDoms !PRO)This will also eliminate cases like Robin and me and he and I, since these contain a PRO. Instead use the non-pronominal_NP definition.
restricting searches to a single IP
CorpusSearch requires that you specify a node boundary within which to search. The node boundary includes everything dominated by the node, no matter how deeply embedded. Thus, if IP* is specified as the node boundary and an IP contains a subordinate clause IP, the contents of the embedded subordinate clause are also within the node. A common error is to write a query such as
query: ((IP* iDomsNumber1 NP-OB*) AND (finite_verb iPrecedes NP-SBJ*))with the intent of finding V2 clauses with a topicalized object. The first function looks for IPs which have an object as the first element; the second for a finite verb immediately preceding the subject. This query will, in fact, find V2 clauses with a topicalized object, but it may also find some other clauses as well. It will find (if there are any) IPs which contain one clause in which the first element is an object, and another different clause within the same node boundary in which the finite verb precedes the subject. Either, one of these clauses may be the main clause and the other a embedded clause, or, they may both be embedded IPs within a dominating IP.
There are two ways to avoid this error and force all parts of the query to apply within the same IP.
query: (((IP* iDomsNumber1 NP-OB*) AND (NP-OB* iPrecedes finite_verb)) AND (finite_verb iPrecedes NP-SBJ*))or alternatively:
query: (((IP* iDomsNumber1 NP-OB*) AND (NP-OB* iPrecedes finite_verb)) AND (finite_verb iPrecedes NP-SBJ*))The repeated instances of NP-OB* in the first example and IP* in the second refer to the same instance of NP-OB* and IP* respectively, thus forcing all parts of the query to be immediately dominated by the node.
To solve our problem the "remove nodes" way, we would first create a file with only single clauses with all embedded nodes removed, by a query such as
remove_nodes: t query: (IP* iDoms finite_verb)This query will produce a file in which every token is an IP containing a finite verb with all embedded IPs removed. The following query:
query: ((IP* iDomsNumber1 NP-OB*) AND (finite_verb iPrecedes NP-SBJ*))can then be used on the output of the first query and will yield only the cases intended. (But note that this query is not actually going to produce all V2 clauses with a topic object anyway, since many such clauses begin with a conjunction or other introductory type word and thus the object will be the second element in the IP*; for a solution to this problem, see ignoring certain nodes).
counting words and remove_nodes
Note that if you have remove_nodes turned on, the string RMV:<rmv_string>, counts as text so you can search for it. It will not, however, be counted as a word when doing word counts (like traces, which likewise are not counted). But, if you count the number of words in a node that contains RMV:<rmv_string>, you will, of course, get the wrong answer, since RMV:<rmv_string> replaces a clause full of words. In order to avoid this result, either don't use remove_nodes when counting, or, use a query like the following which won't count any node containing RMV:<rmv_string>. Nodes containing RMV:<rmv_string> can then be counted separately.
query: (((IP* iDoms NP-OB*) AND (NP-OB* domsWords3)) AND (NP-OB* doms !RMV:*))Another way to do this is to add RMV:* to the ignore list and then, as before, count the nodes containing RMV:* separately.
add_to_ignore: RMV:* query: ((IP* iDoms NP-OB*) AND (NP-OB* domsWords3))