Penn Parsed Corpora of Historical English

The University of Pennsylvania Linguistics Department is home of a long-running project to create syntactically annotated (parsed) corpora of historical English. The project is directed by Anthony Kroch, Professor of Linguistics, and the research associate in charge of corpus annotation is Dr. Beatrice Santorini. The Middle English corpus was constructed by Dr. Ann Taylor, now research associate in charge of corpus annotation at the University of York, England.

The following corpora are available from the Penn project:

  1. The Penn-Helsinki Parsed Corpus of Middle English, second edition. (1.3 million words of running text)
  2. The Penn-Helsinki Parsed Corpus of Early Modern English. (1.8 million words of running text)
  3. The Penn Parsed Corpus of Modern British English, 1700-1914. (under construction)

The corpora are available on CD-ROM, together. The price of the CD, which contains both currently released corpora, is US $300. Further information is available on our order page.

The Penn corpora are distributed with the search program CorpusSearch 2, written by Beth Randall and released under the Mozilla 1.1 Public License as open source software. In addition to being included on the Penn corpus CD-ROMs, CorpusSearch is freely downloadable from sourceforge.net.


Anthony Kroch
Department of Linguistics
University of Pennsylvania
Last modified: Wed Feb 9 15:22:00 EST 2005