5.1 Capitalisation
Initial capitals in the original corpus were changed to lower case at the beginning of sentences, in headings, book titles, quotations, and descriptive naming expressions. The reasoning is that in these cases the capital is not a permanent characteristic of the word and does not signal its grammatical behaviour. The only words which keep their word-initial capital are:
Examples of changes in capitalisation are:
the |
BECOMES |
the |
British |
British |
|
Broadcasting |
broadcasting |
|
Corporation |
corporation |
|
Look |
look |
|
Back |
back |
|
in |
in |
|
Anger |
anger |
In the first example British is a JNP (adjective habitually written with a word-initial capital) and therefore retains its initial capital.
Since there is no clear borderline between 'true' proper names and descriptive naming expressions and since the capitalisation of a word (e.g. Catholic) is not always consistent, it was not possible to arrive at complete consistency in the use of upper- vs lower-case initials in the tagged corpus. Some guidance to changes in capitalisation is given in the 'special information' columns of the vertical version of the tagged corpus (see 2.2, 2.6). For exact information on capitalisation, the user must consult the original corpus text.
5.2 Punctuation marks and sentence/paragraph division
Note further the following changes in the marking of sentences and paragraphs (cf Johansson et al 1978:26f):
As a result of these changes, we find the following marking of sentence/paragraph division in the tagged versions of the corpus (the marking in the original corpus is given within parentheses):
Vertical version |
Horizontal version |
KWIC concordance |
|
Beginning of sentences (^) |
---- |
^ |
^ |
Included sentences (~) |
----- |
~ |
~ |
Beginning of heading (*<) |
---- |
^ |
^ |
Beginning of list (_) |
----- |
^ |
^ |
Beginning of paragraph (|) |
P ('special information') |
indentation |
3 spaces |
5.3 Contractions
Contracted words are split up in the tagged corpus (indicated under 'special information' in the vertical version of the corpus; cf 2.6). See further the end of 7.24 (under 'apostrophe').
5.4 Codes for abbreviations and 'non-English' words
In the original corpus an abbreviation is marked in one of two ways: it is either preceded by \0, as in \OMr; or it is enclosed in curly brackets, as in {OU.S.A.}. In non-English' the tagged corpus all abbreviations are prefixed by \0. The similar 'non codes (cf Johansson et al 1978:28-30) are stripped away. Markers are inserted under 'special information' in the vertical version; see 2.6. In the horizontal version and the concordance there are no 'non-English' codes.
For information on the tagging of abbreviations and 'non-English' words, see 7.19-21.
5.5 Other differences
The following features have been deleted in the tagged versions of the corpus: