PC-KIMMO RULE SPECIFICATION FOR TURKISH MORPHOLOGY by Kemal Oflazer Bilkent University Computer Engineering Department Bilkent, ANKARA, 06533 TURKIYE e-mail: ko@trbilun.bitnet fax: (90) 4 - 266-4126 tel: (90) 4 - 266-4133 Last Updated: 1/12/93

You need to have a copy of the PC-KIMMO software in order make sense of this specification. The Unix version of the PC-KIMMO is available from The University New Mexico / Consortium for Lexical Research Archives by anonymous ftp. machine scheria.nmsu.edu directory /pub/tools/ling-analysis/morphology/pc-kimmo file pckimmo108.tar.Z This set of file contain the PC-KIMMO specification for Turkish. We have adopted the following orthographic conventions. Normal text is assumed to be written with lower case ASCII. Some upper cases letters are used to indicate the special characters in Turkish. These are C - which represents voiceless stop palato-alveolar cons (che) S - which represents voiceless fric. palato-alveolar cons (she) I - which represents high-back non-round vowel O - which represents front round low vowel U - which represents front round high vowel G - which represents the "soft-g", an underlying phoneme which disappears intervocalically between two-vowels. In syllable-final position this phoneme lengthens the preceding vowel. It is always realized on the surface orthography. We use the following upper case letters in KIMMO rules. H - a generic High vowel which may be I i u U on the surface A - vowel which may be a, or e on the surface Z - a lexical consonant which never drops in affixation and realized as s on the surface. (eg. in +ZHz (+WITHOUT)) D - a lexical consonant whose surface defaults to d but may occasionally realized as t. K - a lexical consonant whose surface is k and is never modified in affixation (e.g. in erK (power)) Y - a lexical consonant which deals with the non-standard affixation propertions of words ending in -su (water) and the word ne (what). E - an e which disappears on the surface, used to force vowel harmony for acronyms and foreing words. Q - an a which disappears on the surface, used to force vowel harmony for acronyms and foreing words. The following symbols are also used: % denotes the long vowel which does behave like e in vowel harmony but is realized as a on the surface (hal (condition) hale (condition+DAT) & denotes the long vowel which does behave like U in vowel harmony but is realized as u on the surface ^ denotes the long vowel which does behave like O in vowel harmony but is realized as o on the surface. This is ALWAYS followed by an l and only found in words of foreign origin. The actual phonetic process is the palatalization of the l. We chose not to deal with this as a palatalization process. $ prefixes those vowels in certain roots which undergo ellipsis in affixation. It just provides context for rule application and disappears on the surface. ' separates the proper noun stems from any suffixes for proper ortography. it is transparent to certain rules and opaque to some. Since .txt file is processed with KGEN, we have exploited the fact that KGEN treats a*b* as actually (a*b*)*. The final analysis of the input is presented as a LISP list as attribute value pairs. For example the analyses of "alInmIS" is performed as follows: al+Hn+ymHS ((*CAT* N)(*R* "al")(*POSS* 2SG)(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG)) al+nHn+ymHS ((*CAT* N)(*R* "al")(*CASE* GEN)(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG)) al$In+ymHS ((*CAT* N)(*R* "alIn")(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG)) al+Hn+mHS ((*CAT* V)(*R* "al")(*VOICE* PASS)(*CONV* ADJ "mis")) al+Hn+mHS ((*CAT* V)(*R* "al")(*VOICE* PASS)(*TENSE* NARR)(*AGR* 3SG)) alIn+mHS ((*CAT* V)(*R* "alIn")(*CONV* ADJ "mis")) alIn+mHS ((*CAT* V)(*R* "alIn")(*TENSE1* NARR)(*AGR* 3SG)) where *CAT* is syntactic category *R* is root *POSS* is possessive suffix *TENSE1* and *TENSE2* indicate tense suffixes *AGR* is the verbal agreement *CONV* is a syntactic category conversion. *VOICE* is verbal voice *COMP* is compound verb *SENSE* is negation suffix *CASE* is case *AGR* is plural/singular etc. *SUB* indicates a subcategory *SF* indicates a semantic feature Most other keywords are self-explanatory. Not every morphological breakdown may be semantically very plausible. The analysis for the rather synthetic adverb osmanlIlaStIramayabileceklerimizdenmiSsinizcesine ((behaving) as if your are one of those who we will not be converting into an Ottoman") is as follows: osmanlIlaStIramayabileceklerimizdenmiSsinizcesine osmanlI+lAS+DHr+yAmA+yAbil+yAcAk+lAr+HmHz+DAn+ymHS+ZHnHz+cAsHnA ((*CAT* N)(*R* "osmanlI")(*SUB* PROP)(*CONV* V "las")(*VOICE* CAUS)(*SENSE* NEGC) (*COMP* "yabil")(*CONV* ADJ "yacak")(*AGR* 3PL)(*POSS* 1PL)(*CASE* ABL)(*CONV* V "") (*TENSE1* NARR)(*AGR* 2PL)(*CONV* ADV "casina")) The lexicons are found in the following files: morphotactics -> turk.lex nominal roots -> isim.lex some adjectives -> sifat.lex compound nouns -> birlesikisim.lex verbs -> fiil.lex pronouns -> zamir.lex proper nouns -> ozelisim.lex connectives -> baglac.lex questions -> soru.lex exclamations -> unlem.lex adverbs -> zarf.lex special cases -> ozeldurum.lex some acronyms -> kisaltma.lex some technical jargon -> teknik.lex postpositions -> ilgec.lex The file words.uniq contains a large number of words extracted from papers etc and has been used for testing purposes. There are about 24K roots in the roots lexicons. .rul file is generated by the KGEN compiler. The lexicon files not contain much documentation. A paper describing the whole specification is in progress and will be available soon. Any improvements and bug reports are welcome. This specification is in the public domain but may not be used for commercial purposes. This file must accompany any redistribution. Kemal Oflazer Bilkent University Computer Engineering Department Bilkent, ANKARA, 06533 TURKIYE e-mail: ko@trbilun.bitnet fax: (90) 4 - 266-4126 tel: (90) 4 - 266-4133