Corpus description

General information

The Penn-Helsinki Parsed Corpus of Early Modern English, consisting of nearly 1.8 million words, is part of an ongoing
larger project at the University of Pennsylvania and the University of York to produce syntactically annotated corpora for all stages of the history of English. Each of the texts in the corpus is available in parsed, POS-tagged, and unannotated form, as explained in detail in the annotation guidelines. In addition, the corpus is divided into three subcorpora.
  1. The Helsinki directories, consisting of roughly 573,000 words, contain the Helsinki Corpus in parsed, POS-tagged, and unannotated form.

  2. The Penn1 directories, consisting of roughly 615,000 words, contain a first supplement to the Helsinki Corpus. As far as possible, we have used material by the same authors and from the same editions as the material in the Helsinki Corpus. Where necessary (where the Helsinki Corpus contains an exhaustive sample of a text), we have added new material as summarized below.

  3. The Penn2 directories, consisting of roughly 606,000 words, contain a second supplement to the Helsinki Corpus. Again, we have tried to use material by the same authors and from the same editions as the material in the Helsinki Corpus. However, the Penn2 directories contain more new material than the Penn1 directories.

Wordcount information

As just explained, the PPCEME consists of texts from three time periods in three subdirectories. In the following three tables, we give the number of words in the corpus broken down (1) by time period and subcorpus, (2) by genre, and (3) by individual text. Table 1 contains a wordcount summary by time period and subcorpus.

Table 1: Wordcount summary by time period and subcorpus
Helsinki Penn 1 Penn 2 Total
E1 (1500-1569) 196,754 194,018 185,423 576,195
E2 (1570-1639) 196,742 223,064 232,993 652,799
E3 (1640-1710) 179,477 197,908 187,631 565,016
Total 572,973 614,990 606,047 1,794,010

Table 2 contains a wordcount summary by text genre.

Table 2: Wordcount summary by text genre
Text genre Number of words Percentage
Bible 134,275 7.5%
Biography, autobiography 41,379 2.3%
Biography, other 52,755 2.9%
Diary, private 123,106 6.9%
Drama, comedy 120,428 6.7%
Educational treatise 113,032 6.3%
Fiction 116,494 6.5%
Handbook, other 112,419 6.3%
History 108,706 6.1%
Law 115,863 6.5%
Letters, non-private 59,868 3.3%
Letters, private 116,915 6.5%
Philosophy 85,107 4.7%
Proceedings, trials 105,090 8.4%
Science, medicine 41,786 2.3%
Science, other 79,050 4.4%
Sermon 97,400 5.4%
Travelogue 125,337 7.0%
Total 1,794,010 100%

Finally, Table 3 gives wordcounts by individual text, broken down by time period and subcorpus. The information in the table is also contained in the file WORDCOUNT-PPCEME in the info directory for the PPCEME. That file contains only ASCII characters and is suitable for importing into a spreadsheet program; the record separator is the space character.

Table 3: Wordcount information by individual text, broken down by time period and subcorpus
               Helsinki    Penn1    Penn2

abott-e1              0      363        0
ambass-e1             0        0     1410
aplumpt-e1         1863      319        0
apoole-e1             0        0      211
asch-e1            5050     5339     5209
bedyll-e1           448        0        0
boethco-e1        10144    11772    10514
chaplain-e1           0        0     5171
cromwell-e1         434      512      870
dacre-e1              0        0      991
delapole-e1           0      260        0
dplumpt-e1          397        0        0
ebeaum-e1           401        0        0
ecumberl-e1         190        0        0
edward-e1          6207     6566     6655
eliz-1560-e1          0        0      313
elyot-e1           5603     5752     6059
epoole-e1             0        0      919
fabyan-e1          5732     5907     5805
fisher-e1          4857     5382     5209
fitzh-e1           5264     5439     5613
friar-e1              0      907        0
gascoigne-1500-e1     0      536        0
gascoigne-1510-e1     0      305        0
gcromw-e1           500      526      242
gpoole-1500-e1        0        0     2033
gpoole-1510-e1        0        0     1119
grey-e1               0       88        0
harman-e1          5409     5910     5621
henry-1510-e1         0        0      206
henry-1520-e1       917     1117        0
henry-1530-e1         0        0      758
interview-e1          0        0      380
iplumpt-e1          356        0        0
kscrope-1530-e1     257        0        0
latimer-e1         5099     5430     5121
leland-e1          6772     7132     6742
lords-e1            433        0        0
machyn-e1          6908     7424     7394
manners-e1            0        0      284
marches-e1            0      493        0
merrytal-e1        6656     6950     6846
mhoward-e1            0        0      322
morelet1-e1         507        0        0
morelet2-e1        4207     4295        0
moreric-e1         5709     6398     6458
morewol-e1         1440     1589     1396
mowntayne-e1       6055     6243     1660
mroper-e1           858     4291     3368
mtudor-1510-e1        0      910      470
mtudor-1520-e1        0        0      197
nevill-e1             0      158        0
record-e1          6768     7206     6583
roper-e1           5674     5742     5766
rplumpt-e1          352      459        0
rplumpt2-e1           0      561      335
russell-e1            0        0      371
savill-e1             0     1734        0
stat-1500-e1       7848        0        0
stat-1510-e1          0        0     2318
stat-1530-e1          0     9155        0
stat-1540-e1       3979     3733        0
stat-1550-e1          0        0     5788
stat-1560-e1          0        0     4199
stevenso-e1        6501     6724     3309
surety-e1             0        0      146
thoward-e1          391      297        0
throckm-e1        17364        0        0
torkingt-e1        7446     7675     2764
tunstall-e1        1123      906        0
turner-e1          5142     5307     5215
turnerherb-e1         0        0      819
tyndnew-e1        11150    12365    10848
tyndold-e1        10079    10759    10292
udall-e1           4807     5072     8034
underhill-e1          0        0     5609
vicary-e1          6280     6779     6322
wcecil-1560-e1     1208        0        0
wolsey-e1          1064     1231     1139
wplumpt-1500-e1     336        0        0
wplumpt-1510-e1     227        0        0
wplumpt-1530-e1     342        0        0
-----------------------------------------
armin-e2           5267     5416     5219
authnew-e2        11625    12883    11290
authold-e2        10862    11348    10774
bacon-e2           5948     6139     6169
blundev-e2         6573     7190     6983
boethel-e2         7206     7942     7445
brinsley-e2        5669     5930     5900
clowes-e2          7330     7654     5330
clowesobs-e2          0        0     2091
conway-e2           720        0        0
coverte-e2         6206     6467     6304
deloney-e2         7729     8599     8211
dering-e2             0        0     1763
edmondes-e2        1264     1808     1087
eliz-1570-e2          0        0     1061
eliz-1580-e2          0     1660        0
eliz-1590-e2       1576        0      300
essex-e2           5912     3215        0
essexstate-e2         0     3604     7095
everard-e2          557        0        0
forman-diary-e2       0     4125     3173
forman-e2          4142        0        0
gawdy-e2            944     1027     1295
gifford-e2         6332     6632     6232
harley-e2          1737      461        0
harleyedw-e2          0     1569     1804
hatcher-e2            0      281        0
hayward-e2         5305     5706     5671
hoby-e2            6156     7136     6496
hooker-a-e2        2514     2751     2995
hooker-b-e2        2798     2833     2245
jbarring-e2         875      986      905
jotaylor-e2        9475     9679     9419
joxinden-e2           0      636     1540
jubarring-e2          0      777      582
judall-e2             0        0     9195
knyvett-1620-e2    2984     4419     3641
koxinden-e2         170      205      253
kpaston-e2          843     1164      733
kscrope-1580-e2       0      253        0
madox-e2           6438     6790     6744
markham-e2         6243     6443     6416
masham-e2           859     1022      901
middlet-e2         6401     6573     6432
moxinden-e2           0      332      497
nferrar-e2          127      429      498
perrott-e2         5283     5689     5158
pettit-e2           434      467      574
peyton-e2           343        0        0
proud-1620-e2       176        0        0
proud-1630-e2         0      472        0
raleigh-e2         9300     9503      996
rcecil-e2             0     1398     1157
rferrar-e2          199        0        0
rich-e2               0     1697        0
roxinden-1600-e2      0      319        0
roxinden-1620-e2    458       75        0
roxinden2-e2          0      212      439
shakesp-e2         6923     7498     7290
smith-e2           5399     6732     4683
stat-1570-e2          0     2406     2376
stat-1580-e2       1526        0     6618
stat-1590-e2       5193        0        0
stat-1600-e2       5175        0     2246
stat-1620-e2          0        0     4593
stat-1640-e2          0        0     7130
stow-e2            5559     6245     5327
talbot-e2             0        0      266
tbarring-e2         443      582      607
thoward2-e2           0    17685    18179
trincoll-e2         201        0        0
wcecil-1580-e2      905        0      665
wpaston2-e2         438        0        0
-----------------------------------------
alhatton-e3         758        0        0
alhatton2-e3          0      891        0
anhatton-e3         526      712      609
aungier-e3         1065     1010     1302
behn-e3            5742     6137     5680
boethpr-e3         9075    11272     9737
boyle-e3           5423     4340        0
boylecol-e3           0     1349     5790
burnetcha-e3       5813     5928     5849
burnetroc-e3       6432     6557     6454
capel-e3            480      570      558
charles-1650-e3       0      959        0
charles-1670-e3     824        0     1075
chatton-e3          587      747      576
commiss-e3          715        0        0
conway2-e3            0        0     1794
counc-e3            190        0        0
dell-e3               0      591        0
drummond-e3           0     2065        0
ehatton-e3          360        0        0
ehatton2-e3           0        0      856
eoxinden-1650-e3      0      344        0
eoxinden-1660-e3   1868      519        0
eoxinden-1680-e3      0      558        0
evelyn-e3          5945     6268     6325
farquhar-e3        6369     6891     6098
fhatton-e3          469        0        0
fiennes-e3         5171     5444     5640
fox-e3             5540     6118     6012
fryer-e3           5733     6052     6045
hooke-e3           6381     7275     7189
hoole-e3           6345     6587     6700
hoxinden-1640-e3      0     3357        0
hoxinden-1650-e3      0        0     3817
hoxinden-1660-e3   3395        0        0
jackson-e3            0      500        0
jetaylor-e3        6417     5362        0
jetaylormeas-e3       0     1344     6735
jopinney-e3         407      525     2012
jpinney-e3          879      189        0
langf-e3           7123     7750     7811
lisle-e3           6428     7249     6685
locke-e3           5237     5864     5453
memo-e3               0      863      452
mhatton-e3            0      550        0
milton-e3          6661     7320     7313
montague-e3           0        0     1321
nhadd-1700-e3       255      520      456
oates-e3           8689     9758     9233
osborne-e3          988      474        0
penny-e3           6893     8315     5894
pepys-e3           5189     5821     5346
phenry-e3           576      808      585
proposals-e3          0        0      460
rhaddjr-e3          208        0        0
rhaddsr-1650-e3       0     1936        0
rhaddsr-1670-e3     710        0     2501
rhaddsr-1680-e3       0        0      205
rhaddsr-1700-e3     989        0      186
somers-e3           726        0        0
southard-e3           0      405        0
spencer-1680-e3     793        0        0
spencer-1700-e3       0      979        0
spencer-1700-e3       0        0      822
stat-1660-e3          0        0    10615
stat-1670-e3          0        0     3455
stat-1690-e3      13241    14269        0
strype-e3          1016     1354        0
tillots-a-e3       3201     2814        0
tillots-b-e3       3708     5066        0
tillots-c-e3          0        0     6784
vanbr-e3           8061     8659     8786
walton-e3          5876     6673     6089
zouch-e3              0        0      326

Conventions governing filenames

General conventions

As in the Helsinki Corpus, the filenames for the texts contain an indication of the time period to which they belong. See
Philological information for more details about the individual texts.

In addition, the filenames in the PPCEME contain an indication of which subcorpus they belong to.

A few examples:


In tripling the size of the samples from the Helsinki Corpus, we have sometimes had to include texts by new authors (either because the Helsinki Corpus sample for an author was itself already exhaustive, or because we ran out of text in the course of tripling the sample size). In what follows, we describe the conventions that we have followed in assigning filenames to these new authors. Our general rule has been to leave Helsinki Corpus filenames unchanged, but we have sometimes slightly modified the original Helsinki filenames for clarity and consistency. These modifications as well as which PPCEME files supplement which Helsinki Corpus files are set out in Table 4 at the end of this section.

Name vs. title

Following the conventions of the Helsinki Corpus, authors are identified by name rather than by title. Sovereigns of England are identified by their given name. For instance, Charles II is identified as charles. Other members of the nobility, including members of the royal family, are identified by their surname. For instance, Thomas Howard, earl of Surrey, 2nd duke of Norfolk, is identified as thoward (not norfolk), and Mary Tudor (Henry VIII's sister, not to be confused with his daughter, Mary I, who is not represented in the corpus) as mtudor.

In one or two cases, the Helsinki Corpus uses a title rather than a surname as the basis for a filename. For instance, Eleanor Clifford, countess of Cumberland, is identified as ecumberl (not clifford). In such cases, we retain the Helsinki filename in order to minimize confusion.

Women's names

As a general rule, women are identified by their surname at the time of writing. Generally (though not always), this is a married name. In order to minimize confusion, we do not change filenames to reflect a later marriage. Two examples:

In the correspondence of important families (such as that of the Barringtons, the Hattons, or the Plumptons), the Helsinki Corpus tends to identify women by their birthname, and we retain those filenames. So Anne Finch, countess of Nottingham, nee Hatton, is identified as anhatton (not finch).

In one or two cases, a woman appears in the Helsinki Corpus under her married name despite belonging to one of the important correspondence families. For instance, Joan Everard and Elizabeth Masham, both n&ecutee;e Barrington, are identified as everard (not jobarring) and masham (not ebarring). In such cases, we use the Helsinki filenames in order to minimize confusion.

Modifications of Helsinki Corpus filenames

Under certain circumstances, we have modified the filenames in the Helsinki Corpus for clarity and consistency. The conventions governing these modifications are given here, and the correspondence between the old and new filenames are set out in Table 4 at the end of the section.

Table 4: Summary of filename modifications and PPCEME-Helsinki correspondences
Helsinki filename PPCEME filename
(if different from Helsinki)
Supplemented by
alhatton --- alhatton2, ehatton2
bedyll --- friar, russell
boyle --- boylecol
clowes --- clowesobs
conway --- rich
counc --- dell
ebeaum --- mtudor-1510, mtudor-1520
ecumberl --- manners, delapole
ehatton --- mhatton, montague
eliz1, eliz2 included in eliz-1590 eliz-1560, eliz-1570, eliz-1580
eoxinden included in eoxinden-1660 dering, eoxinden-1650, eoxinden-1680, jackson, zouch
essex --- essexstate
everard --- jubarring
fhatton --- mhatton
harley --- harleyedw
henry1, henry2 included in henry-1520 henry-1530
hooker1 included in hooker-a ---
hooker2 included in hooker-b ---
hoxinden hoxinden-1660 hoxinden-1640, hoxinden-1650
jetaylor --- jetaylormeas
jpinney --- southard, part of jopinney
knyvett included in knyvett-1620 knyvett-1630
kscrope kscrope-1530 grey, kscrope-1580, mhoward
lords --- interview, marches, surety
morelet1, morelet2 --- part of mroper (see Remarks therein)
mowntayne --- underhill
nhadd included in nhadd-1700 nhadd-1710
osborne --- conway2
pettit --- pettit2
peyton --- moxinden
Plumpton correspondence --- abott, apoole, epoole, gascoigne, gpoole, nevill, rplumpt2, savill
proud proud-1620 proud-1630
raleigh --- judall
rferrar --- part of nferrar
rhaddsr included in rhaddsr-1670 and rhaddsr-1700 rhaddsr-1650, rhaddsr-1710
roxinden included in roxinden-1620 roxinden-1600, roxinden2
somers --- drummond
stat3 stat-1500, included in stat-1540; see info for stat-period1.e1 stat-1510, stat-1530, stat-1550, stat-1560
stat4 stat-1590, included in stat-1600; see info for stat-period2.e2 stat-1570, stat-1580, stat-1620, stat-1640
stat7 included in stat-1690; see info for stat-period3.e3 stat-1660
stevenso --- part of udall
strype --- joxinden
thoward --- dacre
throckm --- thoward2
tillots divided into tillots-a, tillots-b tillots-c
torkingt --- chaplain
trincoll --- hatcher, talbot
tunstall --- ambass
turner --- turnerherb
wcecil included in wcecil-1580 wcecil-1560
wpaston2 --- joxinden
wplumpt1 wplumpt-1500 ---
wplumpt2 wplumpt-1510 ---
wplumpt3 wplumpt-1530 ---