Corpus description

General information

The PPCME2 text samples are based largely on the Middle English section of the Diachronic Part of the Helsinki Corpus of English Texts (available from ICAME), with certain additions and deletions. However, the size of the samples is considerably larger. For the earliest Helsinki time period, all texts are exhaustively sampled. For later Helsinki time periods, two texts per period were expanded to 50,000 words. The remaining texts are represented by the Helsinki Corpus sample.

The main Helsinki time periods are M1-M4, each covering approximately one hundred years. In addition, texts originally written in a given period but for which the earliest manuscript is from a later period are given two digit period designations. The following is a list of all Helsinki periods as they appear in the corpus file names:

Period designation Composition date Manuscript date
MX1 unknown 1150-1250
M1 1150-1250 1150-1250
M2 1250-1350 1250-1350
M23 1250-1350 1350-1420
M24 1250-1350 1420-1500
M3 1350-1420 1350-1420
M34 1350-1420 1420-1500
MX4 unknown 1420-1500
M4 1420-1500 1420-1500

The current edition of the PPCME2 includes a total of roughly 1.2 million words of running text. It comprises 55 text samples, each of which is given in three forms: a text file, a part-of-speech tagged file and a parsed file. In addition, there is a file with philological and bibliographical information about each text.

Wordcount information

Below is a table of the number of words by Helsinki time period:

Period Word count
MX1 62,596
M1 195,494
M2 93,999
M23 17,013
M24 35,591
M3 385,994
M34 99,994
MX4 5,168
M4 260,116
Total 1,155,965

The wordcounts for the individual files are given in the table below. The information in the table is also contained in the file WORDCOUNT-PPCME2 in the info directory for the PPCME2. The file contains only ASCII characters and is suitable for importing into a spreadsheet program; the record separator is the space character.

cmaelr3.m23          17013
cmaelr4.m4           11181
cmancriw-1.m1        48566
cmancriw-2.m1        15224
cmastro.m3            6897
cmayenbi.m2          45944
cmbenrul.m3          18221
cmboeth.m3           10415
cmbrut3.m3           49099
cmcapchr.m4          52716
cmcapser.m4           1469
cmcloud.m3           15599
cmctmeli.m3          17005
cmctpars.m3          30416
cmearlps.m2          44521
cmedmund.m4           3847
cmedthor.m34         13949
cmedvern.m3          12843
cmequato.m3           6261
cmfitzja.m4           5652
cmgaytry.m34          5238
cmgregor.m4          37326
cmhali.m1             8495
cmhilton.m34          4963
cmhorses.m3           5902
cminnoce.m4           4329
cmjulia.m1            6810
cmjulnor.m34          5004
cmkathe.m1            8699
cmkempe.m4           60212
cmkentho.m1           4048
cmkentse.m2           3534
cmlamb1.m1            6459
cmlambx1.mx1         20752
cmmalory.m4          57775
cmmandev.m3          49690
cmmarga.m1            8069
cmmirk.m34           57548
cmntest.m3           11081
cmorm.po.m1          50579
cmotest.m3           10015
cmpeterb.m1           6757
cmpolych.m3          46444
cmpurvey.m3          39704
cmreynar.m4           8850
cmreynes.m4           9100
cmrollep.m24         17960
cmrolltr.m24         17631
cmroyal.m34           6231
cmsawles.m1           4111
cmsiege.m4            7659
cmthorn.mx4           5168
cmtrinit.mx1         41844
cmvices1.m1          27677
cmvices4.m34          7061
cmwycser.m3          56402