CorpusSearch has been used to search Middle English, Old, Middle, and Modern English corpora, as well as corpora of Chinese, Korean and Yiddish. In order for
CS to search a corpus it must meet the following formatting requirements:
- Every sentence in the corpus must be completely parsed; that is, every word must be labeled and must be included within the outside brackets of some sentence.
- Phrasal and part-of-speech labels may not contain a space or other white space character, nor may they begin with digits.
- Constituents must be bracketed with parentheses -- "(" and ")" -- not with square brackets or other delimiters.
- Every sentence must have a "wrapper", that is, an unlabeled pair of parentheses surrounding the sentence.
Below is an example of a sentence bracketed in accordance with these guidelines, using the labels of the PPCME2 and PPCEME. Note that CorpusSearch is indifferent to the choice of phrasal and part-of-speech labels.
(
(IP-MAT
(ADVP-TMP (ADV Then))
(NP-SBJ (D the) (N child))
(VBD became)
(ADJP (ADJR happier) (CONJ and) (ADJR happier))
(E_S .))
)
For more information on corpus formatting for CorpusSearch see the
CorpusSearch Users Guide.