Information on the KOLHAPUR CORPUS WORDCRUNCHER VERSION

G. Leitner, Berlin, july 1991

The basis of the WordCruncher version is the original version,
distributed through the Norwegian Computing Centre for the
Humanities, Bergen. We have made the following changes:

1. The line-based version (80 characters per line) has been
converted to a sentence-based one. Sentences begin, as
requested by WordCruncher, with |P and represent the third
(lowest) hierarchical level of analysis.

2. The original marking of (simple or composite) texts (e.g.
**[txt.a01**] has been deleted and replaced by first and
second level hierarchical codes. Thus, **[txt.a01**] has
become |EA01 (i.e. indicator of second level).

3. Paragraph markings (and new lines in multi-line headings),
marked by $ in the original version have been changed into
ASCII 183, i.e. . This allows researchers to look into
paragraph structures and textual cohesion.

4. Headlines, marked by **< and **> (for major heading), as
well as *< and *> (for minor heading) respectively in the
original version have been converted into sentences, but the
original markings have been retained. They can be seen in the
WordCruncher version but cannot be retrieved through the
dictionary (see (7) below).

5. The following commentaries on texts which were in the
original version have been deleted so as not to falsify the
word and sentence counts.

* # end of corpus text
** # end of corpus
*_ marking for original $ (replaced by the $ sign)
** [ . **] begin/end comment tag, including the comment

Other changes:

*'  -->  `
*_  -->  ^
*+  -->  
*@  -->   

6. Changes have been made so as to allow access to Indian and
foreign language materials, as well as to abbreviations
directly. The original version had compound codes preceding
the material in question. We deleted them and added characters
directly at the end of the material. Thus, 

ALT 181  stands for Indian word

ALT 182  stands for foreign word

ALT 184  stands for abbreviations


7. Indian expressions, introduced with *5 and ended with *6,
are replaced by ALT 185 () and ALT 186 () respectively.

8. Foreign expressions, introduced with *8 and ended with *9,
are replaced by ALT 187 () and ALT 188 () respectively.

9. Phonetic codings inside (Indian words, e.g. *?1 for
macrons, *?2 for acute accents, have been deleted so as to
avoid the splitting up of words into several parts as a result
of indexing.

10. As far as indexing is concerned, we suggest to do this:

a) Special codes

ALT 181  Indian word: Delim lower

ALT 182  foreign word: Delim lower

ALT 183  paragraph marker: Delim lower

ALT 184  abbreviation: Lower case

ALT 185  begin Indian expr.: Delim lower

ALT 186  end Indian expr.: Delim lower

ALT 187  begin foreign expr.: Delim lower

ALT 188  end foreign expr.: Delim lower

b) All letters are treated as default definitions of Index
Etc.

c) Numbers must be ignored.

d) * must be ignored.

e) Punctuation marks (except full stop) are to be defined as
Delim lower.

f) Full stop is defined as apostrophe.

g) Hyphen and apostrophe are to be defined as in default
definition.

Note that decision f) guarantees that complex abbreviations
(such as e.g., U.S.A.) appears as one word, with the full
stops, in the dictionary. But full stops, unlike all other
punctuation marks (of e), cannot be retrieved.

Note that the special treatment of  in a) is necessary in
order to retain the final stop in abbreviations ending in a
full stop (stops are treated as apostrophes). As a
consequence, however,  remains part of a word if preceded by
upper/lower case characters (all letters), by full stop (see
f)), hyphen or apostrophe. It appears as a separate entry only
if it is preceded by any of the special codes listed, by
characters which are ignored, or by such characters followed
by a full stop.