CC vs IN
As well as was originally tagged CC CC" CC". But it is not necessarily an idiom. Compare:A second problem is that the idiom as well as appears in positions where coordinating conjunctions are excluded, as in:
Such occurrences were tagged as complex prepositions (IN IN" IN"); cf the related preposition besides. An alternative would have been to treat all occurrences of the idiom as well as as IN.
CS vs IN
Some problematic words are: as, but (see the end of 7.14), except, like, till, until, than. The general rule has been to tag a form as IN before a following complement consisting of a noun, pronoun, or noun phrase (there should be no need to illustrate this). IN can also precede numerals and -ing clauses,35 as in:
The tag IN was also used before certain adverbs of time and place (RN) which can clearly be preceded by prepositions: until now (cf up to now), except here (cf from here), etc. Prepositions can precede other prepositions in some clear cases: from under the clouds (F1 1: 153), came to nearer home (G22:7), etc. But with forms which are either CS or IN, we generally opted for CS in this position.
CS introduces finite clauses (again there should be no need for illustration) and is also used before adjectives, adverbs, prepositional phrases, and non-finite verb constructions (except -ing forms). In these cases it is generally possible to paraphrase with a finite clause. Examples:
Idiom tagging was used in some cases where neither CS nor IN applies; see 7.13, the end.
The tag CS was used somewhat more widely with as and than than with the other words of the same problem group. There are some inconsistencies of tagging, in particular before -ing forms, adverbs, and clauses introduced by conjunctions and WH-words.
As
CS was assigned when as introduces finite clauses and when it occurs before adjectives, adverbs, prepositional phrases, and non-finite verb constructions:
The use of CS before -ing forms is questionable, since these can occur as complements of prepositions.
IN was chosen in positions characteristic of prepositions, i.e. before nouns, pronouns, and noun phrases (provided that they are not part of clause-like structures):
IN also applies in examples like:
The principles outlined above lead to different taggings in examples like:36
Neither CS nor IN seemed applicable in:
Here we resorted to idiom tagging (RB RB" RB"); see also 7.13, example 26.
As can further qualify a following adjective or adverb and is then tagged QL. This tag is unproblematic (for a minor problem, see 7.10, example 76); typical occurrences of QL are found in some of the examples above (20, 21, 22, 23, 25).
Finally, as occurs in many idioms:
CC CC" CC" |
as well as (see the beginning of 7.15) |
CS CS" (CS" CS") |
in as much as, inasmuch as, in so far as, insofar as, so as, such as (see 7.12, ABL), as if, as though |
IN IN" (IN") |
as against, as between, as for, as from, as of, as opposed to, as regards, as to, as versus, as well as (see the beginning of 7.15), such as (see 7.12, ABL) |
RB RB" (RB") |
as good as (see 7.13, example 25), as well (=also), as it were, as yet, so much as (see above, example 29) |
TO TO" TO" |
so as to |
Note that as far as, as long as, etc were not idiom tagged; see 7.14 under 'idioms'.
Than
The two major tags are CS and IN. They were distinguished according to the same principles as with as. Examples:
CS |
Bertie was much more intelligent than most people supposed. K13:60 |
IN |
It was obvious that he had come to say more than good-bye. K18:201 |
The sequence more than was idiom-tagged (RB RB") before prepositional phrases, adjectives, adverbs, and verb forms in examples like:37
Cf further 7.13, the end.
7.16 WH-words
Originally WH-words were tagged according to their syntactic function in the clause:
WDT |
determiner (including pronoun; cf 7.12) |
WP |
pronoun (with subdivisions according to case) |
WRB |
adverb |
No distinction was made between interrogative and relative uses. To make the classification somewhat more refined, we added an R to the tag for relative uses of the following words: which, that, who, whom, whose.
Note the tagging of the following words:
when |
WRB (all uses, including the conjunction) |
where |
WRB (all uses) |
whereas |
CS |
whereupon |
CS |
whether |
CS |
As regards the distinction between W-tags and CS, see further 7.14.
But and as, some uses of which were traditionally analysed as relative, are never given W-tags. See the treatment of these words in Sections 7.14 and 7.15 respectively.
7.17 Numerals
Cardinals are tagged CD, whether they are written as digits or are spelled with ordinary letters. CD includes dozen and zero. The tagging does not vary with syntactic position. There is a special tag for one: CD1. This applies to all uses of one, except when it is part of a sequence with idiom tagging: one another PPLS PPLS", no one PN PN", some one PN PN".
Inflected forms keep their CD tags, with the addition of the usual inflection markers: $ for genitive and S for plural. Examples: one's CD1$, millions CDS, 1930's CDS.
A CD form may include punctuation marks: 2.1, 1,000, 213, etc. But note the special tag for a hyphenated pair of cardinals: 1959-60 CD-CD, 2-1 (football score) CD-CD, pp 3-7 CD-CD, etc. The hyphenated tag is only used where the hyphen is equivalent to a preposition. Thus the regular CD tag is assigned in an example like: from opening at eight till closing at one-thirty (K07:209).
CD also applies to sequences like: 5, +7, -7. But + and - when occurring on their own are tagged IN; see 7.22. Formulas and more complex mathematical expressions are tagged &FO. The same applies to forms with subscripts and superscripts. See 7.22.
Combinations of numbers and ordinary words are frequently restricted to attributive position and are then tagged JJB (cf 7.8): 14-year-old (adj),38 18-bedroom, post-1918, pre-1960, etc. Examples of other types of tagging are: end-1960 CD, niobium-95 NN, 14-year-old (noun) NN.
Numerals frequently combine with units of measurement. Sequences of $ or *+ (£) plus numeral (without an intervening space) are tagged NNU; see 7.19.
Simple combinations of letters and digits (except ordinals; see below) are generally assigned the tag appropriate to the initial character: A20 ZZ, 10a CD, etc.
Ordinals are OD, whether they are spelled with ordinary letters or written as digits plus st, nd, rd, or th. Provision was made in the tagging scheme for ordinal plus genitive, but no examples of the following kind occurred in the text: George the Third's reign, the third's hat was brown. CD1$ was assigned in: in Charles I's day (F42:60).
The tag OD is assigned when the form modifies a noun or is used in a nominal position and is immediately preceded by the definite article (in which case a head noun is usually recoverable from the context). Examples:
The tag NN is assigned elsewhere in nominal positions, as in:
Plural forms are NNS (except in hyphenated fractions; see below): two thirds, nine tenths, etc.
When ordinals occur in adverbial positions, they are tagged RB. Examples:
OD was kept in cases like: second best, third best. Note the idiom tagging of. at first RB RB". But from the first is tagged IN ATI OD.
Fractions are given CD tags, whether they are written with digits (as regards the representation of fractions in the original Lob Corpus, see Johansson et al 1978:32) or as hyphenated words: 1/2 CD, one-half CD, one-and-a-half CD, one-quarter CD, two fifths CDS, etc. Note the distinction between examples like: one-tenth CD, one tenth CD1 NN. For the distinction between ABN and NN with half, see 7.12.
7.18 Interjections
Most of the words tagged as interjections (UH) can be readily identified as such through their phonological/orthographic features and/or their isolation from the surrounding syntactic constructions (often marked by punctuation). Typical examples are response words (ah, aye, h'm, m, mm, no, oh, OK, okay, sure, ummm, yeah, yes), exclamations (ach, alas, boo, boy, bravo, cor, damn, gee whiz, gosh, ha, hell, hurrah, oh dear, wow), and greetings (good-bye, good-morning, goodnight, hallo, hello, hullo, hi, welcome). There are also hesitation signals (er, well), appeals to the addressee (eh, please), directives (hush, lo), and onomatopoeic expressions (um-chink).
Phrases are not tagged UH; thus, for example, good morning is JJ NN and thank you VB PP2. But there may be sequences of interjections (often separated by punctuation marks), e.g.: oh dear, oh hell, oh no, oh please, oh sure, oh well, oh yes.
Individual words which can easily be parts of larger constructions (with the same meaning) are not tagged UH (that's) good, (I'm) sorry, (many) thanks, (my) God.
Problems sometimes arise with words which can either be interjections or belong to some other class. The interjections please and well are, for example, not always marked by features of punctuation:
Please and well can nevertheless easily be identified as interjections through their position in the sentence.
Some words for greetings can be nouns as well as interjections. Good-bye was tagged as a noun in examples like: a sad goodbye, it's good-bye, her goodbyes, bid/kiss/say/wave good-bye. The tagging of words for greetings sometimes varies with punctuation:
Adieu and farewell only occur as nouns in the Lob Corpus. There is a problem with welcome, which can be JJ (you are welcome), NN (a royal welcome), and VB (we welcome you) as well as UH. Note the following examples:
Although welcome is here followed by an adverbial of direction, it was tagged UH
Boy is normally a noun and sure an adjective, but they can also be interjections:
Boy in (7) is an interjection; the utterance is not addressed to a boy. In (8), however, boy is a form of address. Sure in (9) is equivalent to yes. Note the contrast between the two uses of the word in (10).
OK/O.K./okay can either be JJ or UH (there are no instances in the text of the verb). Compare:
In (11) and (12) the forms are integrated syntactically into a larger construction, and the correct tag is therefore JJ. (13) is a clear example of the interjection; there is syntactic and orthographic separation, and the form cannot be part of a larger construction with the same meaning (cf above). (14) is less clear, as we could conceivably expand the two examples as 'Is that OK' and 'That is OK', respectively. Nevertheless, the tag chosen was UH; this is a reflection of our normalcy principle (cf Section 6).
Swear-words are only tagged UH when the are not integrated syntactically with the surrounding constructions. Compare:
Bearing in mind our definition of interjections, these examples are quite clear. But note that punctuation gives misleading clues in (16) and (19); these are examples where the automatic tagging programmes invariably fail. Another swear-word which can have more than one tag is damn. Examples:
The following related words were never tagged UH: damned (JJ, RB, VBN), darn (JJ, RB), darned (JJ, RB), darndest (JJT), durned (RB, VBN).39
7.19 Abbreviations
An abbreviation is marked in the original Lob Corpus by the prefix \0 (as in \0Mr), unless it is part of a sequence of abbreviations or an abbreviated expression, in which case the whole sequence is bracketed: {0B.A.}, {0U.S.}, etc. See Johansson et al (1978:30f). In the tagged text all abbreviations are preceded by \0. There is also a marker under 'special information' in the vertical version of the text; see 2.6. Another change is that abbreviation points are deleted at the end of words (so \0Mr. becomes \0Mr). In the examples below we will omit abbreviation points and the abbreviation marker \0.
The most common tags used with abbreviations are NN, NP... and NNU(S); see below. In other words, abbreviations usually function as nouns. Some examples of the various types of tags are:
CD |
m (=million) |
IN |
nr, v, vs |
JJ |
jun, mod, sep |
JNP |
Inc, Ltd |
NN |
ch, Cmnd, Cons, fig, IQ, Lib, LP, MS, no, para, pl, PT (=Physical Training), TV, VIP, vol, WC |
NNS |
figs, mod cons |
NNU |
cm, ft, in, mph, sec, yd, DM (currency) |
NNUS |
galls, gns, hrs, ins, lbs, mins, pts, yds |
NNP |
Co (=Company), Lat (=Latin) |
NP |
CP Snow, USA, YMCA, TUC, London SW], Ps (=Psalm) |
NPS |
Pss (=Psalms) |
NPL |
Co (=County), Is, Rd, Sq, St (=Street) |
NPT |
Mr, Mrs, Rev, Rt Hon, Sec, Sgt, MP, BA, PhD, CBE, St (=Saint) |
NPTS |
MPs, C1rs |
NR |
Nov, 20 *@ W (*@=degree symbol, W=West) |
RB |
AD, BC, am, pm, pa (=per annum), eg, ie. viz, clo, approx, c (=circa), etc, &c, f, ff |
UH |
OK |
VB |
v below, cf |
The same form may correspond to more than one tag. Examples:
BC |
NP |
Victoria, BC |
C, c |
NNU |
a 100 c Panax scaler |
M, m |
CD |
about 40 m composition bricks |
Rev |
NP |
(=Revelations) |
St |
NN |
(=stitch) |
V, v |
IN |
centigrade v fahrenheit |
An abbreviation may be part of a hyphenated form, as in: KANU-tribes NNS (G68:81), a 10-1b (JJB) bird (E20:65), ll-in (NNU) or 12-in (NNU) is about the maximum permissible (J75:102).40 A tag is assigned to the whole sequence, according to the same principles as with hyphenated forms in general (cf 7.2).
NP-tags
NP is assigned to initials in people's names and to abbreviations naming countries or organisations. Note also abbreviations of books in the Bible. NPL, is used with abbreviated locative nouns with a word-initial capital. NPT is found with abbreviated titular nouns with a word-initial capital, including the letters added after a person's name to indicate degrees, honours, qualifications, etc. For more examples, see 7.7.
NNU(S)
The tag NNU is used with abbreviated noun-like measurement units unmarked for number (singular/plural). NNU applies to many abbreviations; see the examples above. But NNUS is used when the form is marked for plural; see the examples above.
Apart from abbreviations, NNU applies to: $, *+ (£), %, *@ (degree symbol), /- (see 7.22 and 7.24). Note also the tagging of expressions for sums of money: $10 NNU, *+10 NNU. Idiom tagging is used with: per cent NNU NNU".
In the original Lob Corpus there are many combinations of a numeral and an NNU form (without an intervening space): 18ft, 14pts, 5%, etc. A space was inserted after the numeral in the tagged text and the tagging is therefore: 18 (CD) ft (NNU), 14 (CD) pts (NNUS), 5 (CD) % (NNU), etc. An NNU-form. may be hyphenated; see above.
Non-standard forms
In the original Lob Corpus there were special codes for non-standard forms; see Johansson et al (1978:29f). These codes have been removed from the words in the tagged text, but a marker is inserted under 'special information' in the vertical version (see 2.2 and 2.6). In the horizontal version and the concordance there are no indications of non-standard forms. This may be confusing, particularly with nonce-forms like vicilisation (=civilisation) and bunkrapt (=bankrupt); most of these are found in a single text (R07).
Non-standard forms are tagged in the same way as the corresponding -standard forms. Examples: yer (=your) PP$, ze (=the) ATI, t'ing (=thing) NN.
7.21 Foreign words and expressions
In the original Lob Corpus there were special codes for foreign-language material; see Johansson et al (1978:28f). These codes have been removed from the words in the tagged text, but markers are entered under 'special information' in the vertical version (see 2.2 and 2.6).
Words with codes for 'foreign word or expression' (\), 'Cyrillic alphabet' (\ 11), and 'Greek alphabet' (\15) in the original Lob Corpus are tagged &FW. The tagging of 'foreign word or expression widely used' (\6) varies. Single words are generally tagged in the same way as ordinary English words, e.g.: post-mortem NN or JJB,41 adagio (JJ) tempo, pace IN, qua IN, sic RB. Phrases are either &FW (word by word: in extenso, mirabile dictu, sine qua non, etc) or idiom-tagged;42 for some examples of idiom-tagging, see 7.2. Note that the tagging may vary depending upon syntactic position. Examples:
ad lib |
JJ JJ" |
changing from ad lib to restricted feeding |
a priori |
JJ JJ" |
a priori reasoning (J44:13) |
Foreign abbreviations are dealt with in the same way as English abbreviations. For some examples, see 7.19.
7.22 Formulas and scientific symbols
A 'formula' in the tagged Lob Corpus is any word containing non-alphabetical characters which cannot be tagged as anything more specific. The tag &FO applies to:
Among symbols tagged &FO we find: <, =, >, #, */ (asterisk), *?16 (arrow pointing down), *?23 (arrow pointing to the right). Cf the coding scheme in Johansson et al (1978).
Note that the tag &FO is not used in the following cases:
See also the treatment of numerals in 7.17. As regards /, see 7.24.
7.23 Cited forms
The tag NC applies to quoted single words or short phrases embedded in sentences, as in:
There is also a marker under 'special information' in the vertical version of the text; see 2.2 and 2.6). NC is not used to tag direct speech in the dialogue passages. Nor is it used with titles of books and works of art; these receive the regular word tags, but a marker is inserted under 'special information' in the vertical version of the text (see 2.2 and 2.6).
7.24 Punctuation marks
The following characters are included in a word if they immediately precede an alphanumeric character:
- . , : ; / (as regards /, see further below)
Examples: 2-1 .5 2.1 1,000 E:2 2/3. In other contexts punctuation marks are treated as separate 'words' and are given their own tags. The tags are usually identical to the marks themselves as represented in the Lob Corpus (see Appendix 4). An apostrophe is never treated as a separate word. Note the following points.
Full stop
Full stops were inserted in the text of the tagged Lob Corpus under certain circumstances (see 5.2). They were deleted at the end of abbreviations. A full stop is only treated as a 'word' when marking the end of sentences (tagged .) and occasionally when used in a mathematical expression (tagged IN; cf 7.22). Ellipsis is marked by three full stops
Colon
A colon may occasionally be IN: ... the survival ratio was high, at around 1: 100 (E07:77).
Slash (/)
Like some other punctuation marks, / is multi-functional. In the original Lob Corpus a space was inserted after /, except in the case of and/or and fractions. The last two types are still treated as single words in the tagged text (cf 7.14 and 7.17, respectively). In all other cases / was separated out and treated as a word by itself. The usual tags are either IN (=per) or CC. Examples: 90,000 gall / hour (E27:95), an explosive mixture of methane / air (J72:29). NNU applies in cases like (=shilling, shillings): 1 /- per share (H28:47), 3 /- per share (H27:86).
Hyphen
Sequences containing a hyphen are treated as one word. For examples of differences in tagging depending upon hyphenation, see 7.1. As regards the treatment of particular types of hyphenated forms, see 7.7 (proper nouns), 7.8 (attributive forms), 7.17 (numerals), and 7.19 (abbreviations). There is a clear distinction between hyphen (-) and dash (*-) in the Lob Corpus.
When - is surrounded by spaces, it is treated as a separate word. The tag IN applies when it is equivalent to minus (cf 7.22) and also in examples like (=to): from 4 1/2 - 7 years (J23:133).
Apostrophe
An apostrophe may open or close a word: a' (=all), 'em (=them), no' (=not), 'n (=and, in), an' (=and), o' (=of), wi' (=with), etc. Many such forms are written as separate words and are then tagged in the same way as the full forms. Contractions forming regular patterns are split up in the tagged Lob Corpus and each part is treated as a separate word. Examples:
I'll |
|
I 'll |
haven't |
|
have n't |
There is no splitting up when the apostrophe marks the genitive inflection: John's, workers', somebody's, other's, others', one's, else's, etc. Nor are words split up in the less common cases where the apostrophe + s mark the plural: MP's, the 1930's, etc. Note further: a'right RB, o'clock RB, rock'n'roll NN, d'Alba NP, O'Hara NP, etc. The following colloquial forms (not marked by an apostrophe) are tagged as verbs: gonna VBG, wanna VB (=going to, want to).
There is a clear contrast between apostrophe (') and quotation marks (*' or **') in the tagged Lob Corpus.
7.25 Letters
Even the seemingly simple matter of tagging letters is not always straightforward. In the first place, we must cope with different alphabets (p, pi, etc). Secondly, a single letter may have a variety of uses. As an example, consider the letter a (excluding abbreviations, which were given special codes in the original Lob Corpus; cf 7.19):
&FW |
e.g. |
vis a vis |
AP |
a few, a little (AP AP") |
|
AT |
e.g. |
a baby |
IN |
e.g. |
a la (IN IN") Dietrich |
JJ |
e.g. |
a priori (JJ JJ") reasoning |
RB |
a little (RB RB") |
|
ZZ |
e.g. |
The tag ZZ is only used when it is not possible to assign a more specific tag. Other examples of letters assigned different tags are (again excluding abbreviations): I PP1A or ZZ, m UH or ZZ, xIN (cf 7.22) or ZZ.
When a letter has noun inflection, it receives an NN-tag, as in: substituting for the D's (J76:8), B can partake in A's light-signalling experiment (J51:53).
As regards the tagging of combinations of letters and numerals, see 7.17. For the tagging of letter sequences in a scientific context, see 7.22.