The QWICK Query Language

In order to specify what it is you want to retrieve from the corpus, you use a query language. The query language used by QWICK has been designed to be both easy to use and powerful at the same time; if you have any comments on it please email them to qwick@clg.bham.ac.uk.

Unlike some other corpus query tools, QWICK has no graphical interface to build up a query expression. Instead, the query is entered into a text field in the Query Dialogue window. This means that it will generally be faster to enter a query, as you just type it in without having to use the mouse excessively, but it also means that extended functionality has to be expressed in some other way.

Query Objects

The most frequent thing you might want to look at is a single word. To do this you just enter the word itself into the query text field. There are a few pitfalls to look out for:

Problem 1: Special Characters

If you have a way to enter special characters (ie non-ASCII characters, mostly with accents and diacritics) from your keyboard, this should work fine for queries. If you want to look for Phänomen, and your keyboard has no `ä' key, you have to enter it's ISO name: ä instead. This holds for all characters that are not part of the standard ASCII character set. They all start with an ampersand, and end with a semicolon. Some common names you might want to use are `agrave' (à), `auml' (ä) and `bquo' (``) and `equo' ('').

Problem 2: Quoted Words

If a word you enter contains characters other than `a' to `z' and digits, you will have to put it in quotes. You can either use single quotes (') or double quotes ("), as long as they match. So if you want to look for don't you will need to type "don't", as the apostrophe is a non-alphanumerical character. (Actually, you will with some corpora (eg the BNC) have to type do "n't", as the word don't is split into two tokens)

Wildcards

You can also specify parts of a word only, replacing the rest of it with a so-called wild card. You will have to quote words using a wild card, as it is not a letter. The character to use is the asterisk (*), and it can be at any position in the word. Please note that wild card matching is not very fast in the current version, but it will improve in speed in later releases.

Some Examples

Compound Queries

Looking at single words might not be enough for what you want to do. There are several other features which allow you to be more specific in what you are asking for:

Word Sequences

If you are looking for a set phrase, you can just enter all the words next to each other. Typing the White House will give you all instances of the American presidential residence which are preceded by the definite article. All queries are case sensitive, so you would not get any other houses that are white. See `alternatives' below for case insensitivity.

In case a word sequence contains quoted words, you have to quote each word individually. Everything within quotes will be treated as an atomic entity, and chances are that it will not be found in the index. Also, be careful with spaces within quoted words, as they prevent locating the word in the index.

In word sequences you can also use a wild card: the * house finds all instances of `the' and `house' separated by exactly one word. More on this is described below under `scope'.

Alternatives

Alternatives are specified as a list of items separated by the vertical bar (|). You can search for a list of buildings with house|cottage|hut|castle, with no limit on the number of words. This can also be used for case or spelling variations: colour|color and House|house|HOUSE might be needed to find everything you want to get.

Scope

Unfortunately words are not usually well-behaved enough to occur always right next to each other. Often other modifiers interfere with the patterns you are looking for. For this purpose it is often useful to be able to look for more variable spans. This can be done using the span operator: it combines two words (which can be lists of alternatives as above) and returns all instances (with the first word being the node word) where the second word is within the specified distance of it. The distance is expressed separately for the left and right sides, as in house 4:2 red. This will find `the red house', `red was the house', `the house is red', but not `the house has bright red windows', as `red' is three words away from `house', but only two words are permitted.

The span can also be zero, if you want the word only to be on one side of it, as in house 3:0 red, where `red' has to be before `house'. This expression is equivalent to red 0:3 house, but in the latter form the node word would be `red', whereas in the former it would be `house'.

Another possiblility for expressing scope is using tags (if the corpus is marked up accordingly). Note: in the current release there are some problems with the processing of tags. You might sometimes not get the results you expected to get; it is hoped that the problem will be fixed in a later version.

For working with tags you use the within operator. It also combines two words, and finds all instances where the two words (or lists of words) are within the scope of the same tag, as in house within(s) fire. Here the `s' tag stands for sentences, which are commonly marked with this tag. It should get you all occurrences of `house' which have the word `fire' in the same sentence. Please note that the other word might not be visible in the concordance display if it is too far away from the node word; you might have to look at the extended context to see it.

This also works for phrases. If you find it too confusing how the query is expressed, you can use brackets to make obvious what you are looking for; the computer is rather simple minded and will not always understand it the same way as you do. For example, house within(s) on fire will combine `house' with `on fire', as phrases have a higher precedence of evaluation than scope operators.

Tags

Note: This section only applies to corpora which are marked up using SGML or XML codes.

If your corpus is marked up, you can also look for tags. Tags are not visible in the concordance display, and the node word will usually be the lexical item following the tag. Tags are expressed in the way they would be in an SGML marked text, namely enclosed in angle brackets. <sic> would find you all instances where the person marking the data up wanted to stress that something did look weird but is not actually a transcription error. You would only be able to see the first word that has been marked, so if there were more than one word you would not be able to see which ones they were. This limitation is only one of the user interface component and will probably be changed in a later release.

Attributes can also be specified when searching. They have to be enclosed in quotes (single or double), so looking for all coordinating conjunctions you would type <w type="CC">. This can be combined with a word form, as in light & <w type='JJ'>, where the logical `and' operator filters out all instances of light which are not adjectives. For a description of the tagset you need to look at the documentation of the corpus you're working with.

In order to make retrieval of tags more flexible, a simple wildcard facility has been implemented: If an attribute value ends with an asterisk, only the first part has to match. This is particularly useful for part of speech tags, where <w type='N*'> will find all nouns, no matter whether they are singular or plural or proper names.

Unlike lexical words, tags are currently not indexed. This means that retrieval times will be slightly slower than those of words, especially if you are looking at tags which occur very frequently. In a future release this might change in order to speed up retrieval.

Default Scope

If you ever look for the word `British' in the BNC, you will get an awful lot of hits which are references to the project setting up the corpus itself. This is because text header information is treated in the same way as the text proper. The advantage of this is that you can easily include header information into your queries to specify which texts you want to look at; if spoken data has information about the place where it was recorded you might be able to look for the word `innit' and how it is used in spoken English in the Midlands using the query (innit within(stext)) within(bncdoc) (<setting place="Midlands"> within(header)).

This rather complex looking query illustrates several points:

Going back to the `British' problem at the beginning of this section, it would be a bit of a nuisance to always specify that you don't want the headers included in your queries. For that reason each corpus has a default scope which would be `stext' for a spoken and `text' for a written corpus. This default scope is applied to all queries where no other scope is specified. All single word queries thus ignore occurences in the corpus header, unless you specifically ask for them. This can be done by specifying either the header or the whole document as scope, as in British within(bncdoc) or British within(header). The reason for providing a default scope is to make the system behave in a more intuitive way; but it does not prevent you from looking at the header information if you know what you're doing.


Back to the index page