3 The LOB tag set

Each word and punctuation mark is accompanied by a tag. The tags consist of one to five characters, usually mnemonic labels forming groups with a transparent, partially hierarchical structure. The LOB tag set is based on that used for the tagged Brown Corpus (cf Francis and Kucera 1979, 1982) but is considerably larger. The goal has been to achieve greater delicacy, while preserving comparability with the Brown Corpus. The tags are listed in
Appendix 4.

The LOB tag set, though larger and more delicate than that used for the Brown Corpus, represents a fairly broad classification. Greater delicacy had to be sacrificed in order to achieve the goal of tagging the entire LOB Corpus.

3.1 An overview of the LOB tag set

The tags consist of a base, which is very often followed by 'suffixes' marking subclass and/or inflection (usually in this order). At the end, there may also be a ditto mark; cf 3.3. The main base tags and suffixes are listed below.

Base tags

Description

Reference

A...

determiner/pronoun

7.12

BE...

be (lexical verb or aux)

7.5

CC

coordinating conjunction

7.14

CD...

cardinal numeral

7.17

CS

subordinating conjunction

7.14

DO...

do (lexical verb or aux)

7.5

DT...

determiner/pronoun

7.12

EX

existential there

7.10

HV...

have (lexical verb or aux)

7.5

IN

preposition

7.13

J...

adjective

7.8

MD

modal auxiliary

7.5

N...

noun

7.6 - 7.7

OD...

ordinal numeral

7.17

P...

pronoun

7.12

QL...

qualifier

7.10

R...

adverb

7.10

TO

infinitival to

7.13

UH

interjection

7.18

VB...

lexical verb

7.3 - 7.5

W...

WH-word

7.16

XNOT

not

 

ZZ

letter

7.25

Suffixes

Description

May occur with

A

nominative

 

determiners

pronouns

nouns

numerals

 

0

accusative

1

singular or plural

S

plural

$

possessive

R

relative

D

past tense

 

verbs

G

present participle, gerund

N

past participle

Z

3rd person singular

     

R

comparative

adjectives

T

superlative

adverb

     

See further the list of tags in Appendix 4.

3.2 Differences with respect to the Brown tag set

The principal differences between the Brown and LOB tag sets are:

In addition, contractions are split up in the LOB Corpus (see 5.3 and the end of 2.4), while the Brown Corpus uses compound tags. See Francis and Kucera (1979). These differences in the tag set and varying practice in some cases in applying the tags (especially a somewhat wider use of non-participle tags for -ed and -ing forms; cf 7.3 and 7.4) mean that comparisons between the tagged LOB Corpus and the tagged Brown Corpus must be made with caution.

3.3 Ditto tags

Ditto tags were introduced to deal with words whose syntactic role in combination differs from the role of the same words in other contexts. Examples.

as to

IN IN"

(complex preposition)

each other

PPLS PPLS"

(complex reflexive pronoun)

so as to

TO TO" TO"

(complex infinitive marker)

Such idiom tagging makes it possible to avoid an arbitrary or counter-intuitive tagging at the level of the single word. See further 7.2.

The ditto tags occurring in the tagged corpus are included in the list in Appendix 4. The total number of words assigned such tags is fairly low (approx 5,500).