Information on the KOLHAPUR CORPUS WORDCRUNCHER VERSION G. Leitner, Berlin, july 1991 The basis of the WordCruncher version is the original version, distributed through the Norwegian Computing Centre for the Humanities, Bergen. We have made the following changes: 1. The line-based version (80 characters per line) has been converted to a sentence-based one. Sentences begin, as requested by WordCruncher, with |P and represent the third (lowest) hierarchical level of analysis. 2. The original marking of (simple or composite) texts (e.g. **[txt.a01**] has been deleted and replaced by first and second level hierarchical codes. Thus, **[txt.a01**] has become |EA01 (i.e. indicator of second level). 3. Paragraph markings (and new lines in multi-line headings), marked by $ in the original version have been changed into ASCII 183, i.e. ·. This allows researchers to look into paragraph structures and textual cohesion. 4. Headlines, marked by **< and **> (for major heading), as well as *< and *> (for minor heading) respectively in the original version have been converted into sentences, but the original markings have been retained. They can be seen in the WordCruncher version but cannot be retrieved through the dictionary (see (7) below). 5. The following commentaries on texts which were in the original version have been deleted so as not to falsify the word and sentence counts. * # end of corpus text ** # end of corpus *_ marking for original $ (replaced by the $ sign) ** [ . **] begin/end comment tag, including the comment Other changes: *' --> ` *_ --> ^ *+ --> œ *@ --> ø 6. Changes have been made so as to allow access to Indian and foreign language materials, as well as to abbreviations directly. The original version had compound codes preceding the material in question. We deleted them and added characters directly at the end of the material. Thus, ALT 181 µ stands for Indian word ALT 182 ¶ stands for foreign word ALT 184 ¸ stands for abbreviations 7. Indian expressions, introduced with *5 and ended with *6, are replaced by ALT 185 (¹) and ALT 186 (º) respectively. 8. Foreign expressions, introduced with *8 and ended with *9, are replaced by ALT 187 (») and ALT 188 (¼) respectively. 9. Phonetic codings inside (Indian words, e.g. *?1 for macrons, *?2 for acute accents, have been deleted so as to avoid the splitting up of words into several parts as a result of indexing. 10. As far as indexing is concerned, we suggest to do this: a) Special codes ALT 181 µ Indian word: Delim lower ALT 182 ¶ foreign word: Delim lower ALT 183 · paragraph marker: Delim lower ALT 184 ¸ abbreviation: Lower case ALT 185 ¹ begin Indian expr.: Delim lower ALT 186 º end Indian expr.: Delim lower ALT 187 » begin foreign expr.: Delim lower ALT 188 ¼ end foreign expr.: Delim lower b) All letters are treated as default definitions of Index Etc. c) Numbers must be ignored. d) * must be ignored. e) Punctuation marks (except full stop) are to be defined as Delim lower. f) Full stop is defined as apostrophe. g) Hyphen and apostrophe are to be defined as in default definition. Note that decision f) guarantees that complex abbreviations (such as e.g., U.S.A.) appears as one word, with the full stops, in the dictionary. But full stops, unlike all other punctuation marks (of e), cannot be retrieved. Note that the special treatment of ¸ in a) is necessary in order to retain the final stop in abbreviations ending in a full stop (stops are treated as apostrophes). As a consequence, however, ¸ remains part of a word if preceded by upper/lower case characters (all letters), by full stop (see f)), hyphen or apostrophe. It appears as a separate entry only if it is preceded by any of the special codes listed, by characters which are ignored, or by such characters followed by a full stop.