Creating the Penn-Helsinki Parsed Corpora of Historical EnglishThe Penn-Helsinki parsed corpora are created by interleaving standard automated natural language processing techniques with human correction. Similar techniques were used in the production of the Penn Treebank (Marcus et al. 1993). First, a sample of text is tagged for part-of-speech (POS) by a human annotator. This tagged text is used to train an updated version of the Brill tagger (Brill 1993, Florian and Ngai 2001). The trained tagger is then run on the whole corpus to produce POS-tagged output. The automatically tagged output is then corrected by annotators using an Emacs-based interface tool developed in connection with the Penn Treebank, primarily by Robert MacIntyre. Error rates for the tagging of historical material vary between 10% and 20%. Once the POS-tagged files are corrected, a training sample of parsed text is again created "by hand" and used to train a version of the Collins/Bikel statistical parser (Collins 1999, Bikel 2004), modified for treebank construction by Seth Kulick. The trained parser is then run on the POS tagged corpus and its output is corrected using MacIntyre's Emacs-based interface. The error rate in the automatically parsed output is approximately 20%. The hand-corrected output is additionally checked for correctness and consistency using CorpusSearch 2 (Randall 2005-2007), a query language for parsed corpora. The release versions of the POS-tagged and untagged files of the corpus are derived from the final version of the parsed files. In addition to its use in corpus construction, CorpusSearch 2 allows users of parsed or POS-tagged corpora to retrieve syntactic configurations of interest. References
Citing the Penn-Helsinki Parsed Corpora of Historical English
|