Creating the Penn-Helsinki Parsed Corpora of Historical English

The Penn-Helsinki parsed corpora are created by interleaving standard automated natural language processing techniques with human correction. Similar techniques were used in the production of the Penn Treebank (Marcus et al. 1993).

First, a sample of text is tagged for part-of-speech (POS) by a human annotator. This tagged text is used to train an updated version of the Brill tagger (Brill 1993, Florian and Ngai 2001). The trained tagger is then run on the whole corpus to produce POS-tagged output. The automatically tagged output is then corrected by annotators using an Emacs-based interface tool developed in connection with the Penn Treebank, primarily by Robert MacIntyre. Error rates for the tagging of historical material vary between 10% and 20%.

Once the POS-tagged files are corrected, a training sample of parsed text is again created "by hand" and used to train a version of the Collins/Bikel statistical parser (Collins 1999, Bikel 2004), modified for treebank construction by Seth Kulick. The trained parser is then run on the POS tagged corpus and its output is corrected using MacIntyre's Emacs-based interface. The error rate in the automatically parsed output is approximately 20%. The hand-corrected output is additionally checked for correctness and consistency using CorpusSearch 2 (Randall 2005-2007), a query language for parsed corpora. The release versions of the POS-tagged and untagged files of the corpus are derived from the final version of the parsed files.

In addition to its use in corpus construction, CorpusSearch 2 allows users of parsed or POS-tagged corpora to retrieve syntactic configurations of interest.

References

  1. Bikel, Daniel. 2004. On the parameter space of generative lexicalized statistical parsing models. Ph.D. dissertation, Department of Computer Science, University of Pennsylvania.
  2. Brill, Eric. 1993. A corpus-based approach to language learning. Ph.D. dissertation, Department of Computer Science, University of Pennsylvania.
  3. Collins, Michael. 1999. Head-driven statistical models for natural language parsing. Ph.D. dissertation, Department of Computer Science, University of Pennsylvania.
  4. Florian, Radu, and Grace Ngai. 2001. Multidimensional transformation-based learning. In Proceedings of CONLL '01, pp. 1-8. Toulouse.
  5. Marcus, Mitch, Beatrice Santorini and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313-330.
  6. Penn Treebank Project. http://www.cis.upenn.edu/~treebank
  7. Randall, Beth. 2005-2007. CorpusSearch 2. http://corpussearch.sourceforge.net

 

Citing the Penn-Helsinki Parsed Corpora of Historical English

Please cite the currently released corpora as follows:
  • Kroch, Anthony, and Ann Taylor. 2000. Penn-Helsinki Parsed Corpus of Middle English, second edition. http://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-2/
  • Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. Penn-Helsinki Parsed Corpus of Early Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-1/