2 Tagged versions
In the tagged corpus each word is accompanied by a word-class tag, assigned through a combination of automatic tagging programs and manual pre- and post-editing. There is no syntactic bracketing. The LOB tagging suite will be described in some detail in Section 4 below. There are two versions of the tagged corpus:
I: a horizontal format, with a running text where each word is immediately followed by its associated tag;
II: a vertical format, where each word is on a separate line together with its associated tag, some 'special information' (see 2.6), and a reference number.
All versions of the LOB Corpus text (tagged and untagged) are available only on tape and are available only for use by academic researchers. Concordances based on the corpus are available on tape and microfiche. See Section 8.
Users of the material are asked to notify the compilers of errors and inconsistencies in tagging.
2.1 Description of tape and files - vertical version
Code: |
ASCII |
Tracks: |
9 |
Density: |
1600 or 6250 |
Label: |
none |
Parity: |
odd |
Files: |
54 |
EOF marks: |
1 after each file, 2 at end of tape |
Record size: |
60 |
Blocking factor: |
100 |
2.2 Description of records - vertical version
Column |
Contents |
1-11 |
Reference number |
13-17 |
Tag |
19-43 |
Word |
45-49 |
These columns may contain the following details: |
50 |
H heading |
51 |
N descriptive name |
52 |
T title |
53 |
C cited word |
54 |
F foreign word |
55 |
@PREEDIT1 query (see below) |
57 |
P paragraph marker |
58 |
I included-sentence marker |
Columns which have not been specified are blank. For a description of the codes, see 2.6.
2.3 Description of tape and files - horizontal version
Code: |
ASCII |
Tracks: |
9 |
Density: |
1600 or 6250 |
Label: |
none |
Parity: |
odd |
Files: |
54 |
EOF marks: |
1 after each file, 2 at end of tape |
Record size: |
80 |
Blocking factor: |
100 |
2.4 Description of records horizontal version
Column |
Contents |
1-7 |
Reference |
9-80 |
Tagged text (sequences of WORD-TAG) |
One original text line may be divided into two lines in the tagged corpus.
The reference consists of the LOB Corpus line identification, i.e. text sample code (letter A to R and two digits) and line number (1-3 digits). The line identification is the same as in the original, untagged corpus. See also 2.8 and 2.9.
2.6 Special information in the vertical version
The vertical version of the corpus may contain 'special information' in the columns after the word (cf 2.2):
H, N, and T are useful where there has been a change in capitalisation; see 5.1. C (like NC, see 7.23) occurs rather sparingly and only with short cited words and phrases. The 'special information' was put in at the pre-editing stage and may contain inconsistencies, due to changes at later stages of the project. The 'special information' in the vertical version also includes a paragraph marker (P) and a marker for included sentences (I), both put in at a late stage. The markers are given after the first word of the sentence/paragraph.
Codes for abbreviations and 'non-English' words:
\ |
foreign word |
\0 |
abbreviation |
\1 |
non-current English |
\2 |
non-standard English |
\3 |
foreigner English |
\4 |
science fiction |
\5 |
miscellaneous |
\6 |
foreign word or expression widely used |
\11 |
Cyrillic alphabet |
\15 |
Greek alphabet |
@ flag
The query marker @ was inserted by the automatic pre-edit program PREEDIT1 in a column of the verticalised text when certain changes or additions had been made to the horizontal text and in certain other circumstances when verticalisation was problematic:
1. |
Full-stop was inserted because one of the following characters had been encountered without preceding terminal punctuation: |
2. |
One of the following characters or strings had been changed: |
3. |
One of the following characters had been treated as a single word though this might be problematic: |
Because of the differences in word division, the number of words is somewhat higher than in the original, untagged corpus:
A |
89,139 |
J |
161,907 |
B |
54,447 |
K |
59,205 |
C |
34,321 |
L |
49,145 |
D |
34,388 |
M |
12,120 |
E |
76,916 |
N |
59,390 |
F |
89,094 |
P |
59,382 |
G |
155,342 |
R |
18,203 |
H |
60,769 |
Tot. |
1,013,768 |
These figures do not include the punctuation marks.
2.8 Sample text extract - vertical version
A012001 ----- ------------------------------ A012002 *' *' H A012010 VB stop H A0 12020 VBG electing H A0 12030 NN life H A0 12040 NNS peers H A0 12041 **' **' H A0 12042 . . H @ A0 13001 ----- ------------------------------- A013010 IN by H A013020 NP Trevor H A0 13030 NP Williams H A013031 . . H @ A014001 ----- ------------------------------- A014010 AT a P A014020 NN move A014030 TO to A014040 VB stop A014050 NPT \0Mr \0 A014060 NP Gaitskell A014070 IN from A014080 VBG nominating A014090 DTI any A014100 AP more A014110 NN labour N A015010 NN life A015020 NNS peers N A015030 BEZ is A015040 TO to A015050 BE be A015060 VBN made A015070 IN at A015080 AT a A015090 NN meeting A015100 IN of A015110 NN labour N A015120 NPTS \0MPs \0 A015140 NR tomorrow A015141 | ||||| | ||||Separator (1 digit) | |||Word no. (2 digits) | ||Line no. (1-3 digits) | Sample no. (2 digits) Text category (letter)
2.9 Sample text extract - horizontal version
A012 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._. A013 ^ by_IN Trevor_NP Williams_NP ._. A014 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A014 nominating_VBG any_DTI more_AP labour_NN A015 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A015 of_IN labour_NN \0MPs_NPTS tomorrow_NR A016 ^ \0Mr_NPT Michael_NP Foot_NP has_HVZ put_VBN down_RP a_AT A016 resolution_NN on_IN the_ATI subject_NN and_CC A017 he_PP3A is_BEZ to_TO be_BE backed_VBN by_IN \0Mr_NPT Will_NP A017 Griffiths_NP ,_, \0MP_NPT for_IN Manchester_NP A018 Exchange_NP ._. A019 ^ though_CS they_PP3AS may_MD gather_VB some_DTI left-winj_JJB A019 support_NN ,_, a_AT large_JJ majority_NN A010 of_IN labour_NN \0MPs_NPTS are_BER likely_JJ to_TO turn_VB down_RP A010 the_ATI Foot-Griffiths_NP A011 resolution_NN ._. A012 ^ *' *' abolish_VB Lords_NPTS **' **' ._. A013 ^ \0Mr_NPT Foot's_NP$ line_NN will_MD be_BE that_CS as_CS labour_NN A013 \0MPs_NPTS opposed_VBD the_ATI A014 government_NN bill_NN which_WDT brought_VBD life_NN peers_NNS into_IN A014 existence_NN , they_PP3AS should_MD A015 not_XNOT now_RN put_VB forward_RB nominees_NNS ._. A016 ^ he_PP3A believes_VBZ that_CS the_ATI House_NPL of_IN Lords_NPT A016 should_MD be_BE abolished_VBN and_CC that_CS A017 labour_NN should_MD not_XNOT take_VB any_DTI steps_NNS which_WDT A017 would_MD appear_VB to_TO *'_*' prop_VB up_RP **'_**' an_AT A018 out-dated_JJ institution_NN | ||Line no. (1-3-digits) | |Sample no. (2 digits) Text category (letter)