<!-- ECI Corpus File:   (tur01)                                             -->
<!--       Component:   (a)                                                 -->
<!-- Copyright 1994 ACL.                                                    -->
<!-- Material as supplied copyright Kemal Oflazer 1994, used with permission. -->
<!--                                                                        -->
<!-- Documentation for  PC-KIMMO RULE SPECIFICATION FOR TURKISH MORPHOLOGY  -->

<div1 type=file n="turkish.doc">
<head>
 PC-KIMMO RULE SPECIFICATION FOR TURKISH MORPHOLOGY
	by
	Kemal Oflazer

	Bilkent University
	Computer Engineering Department
	Bilkent, ANKARA, 06533 TURKIYE

	e-mail: ko@trbilun.bitnet
	fax: (90) 4 - 266-4126
	tel: (90) 4 - 266-4133

Last Updated: 1/12/93
</head>
<p>
You need to have a copy of the PC-KIMMO software in order make
sense of this specification.  The Unix version of the PC-KIMMO
is available from The University New Mexico / Consortium for
Lexical Research Archives by anonymous ftp.

machine scheria.nmsu.edu
directory /pub/tools/ling-analysis/morphology/pc-kimmo
file pckimmo108.tar.Z


This set of  file contain the  PC-KIMMO specification
for Turkish. We have adopted the following orthographic 
conventions.  Normal text is assumed to be written with
lower case ASCII. Some upper cases letters are used to indicate
the special characters in Turkish. These are
	C - which represents voiceless stop palato-alveolar cons (che)
	S - which represents voiceless fric. palato-alveolar cons (she)
	I - which represents high-back non-round vowel
	O - which represents front round low vowel
	U - which represents front round high vowel
	G - which represents the "soft-g",  an underlying phoneme
	    which disappears intervocalically between two-vowels.
	    In syllable-final position this phoneme lengthens the
	    preceding vowel. It is always realized on the surface
	    orthography.

We use the following upper case letters in  KIMMO rules.
	H - a generic High vowel which may be I i u U on the surface
	A - vowel which may be a, or e on the surface
	Z - a lexical consonant which never drops in affixation
	    and realized as s on the surface.  (eg. in +ZHz (+WITHOUT))
	D - a lexical consonant whose surface defaults to d but may
	    occasionally realized as t.
	K - a lexical consonant whose surface is k and is never modified
	    in affixation (e.g. in erK (power))
	Y - a lexical consonant which deals with the non-standard
	    affixation propertions of words ending in -su (water)
	    and the word ne (what).
        E - an e which disappears on the surface, used to force
            vowel harmony for acronyms and foreing words.
        Q - an a which disappears on the surface, used to force
	    vowel harmony for acronyms and foreing words.

The following symbols are also used:
	 %  denotes the long vowel  which does behave like e in vowel harmony
	    but is realized as a on the surface (hal (condition) hale (condition+DAT)
	 &amp;  denotes the long vowel which does behave like U in vowel harmony
	    but is realized as u on the surface 
	 ^  denotes the long vowel  which does behave like O in vowel harmony
	    but is realized as o on the surface. This is ALWAYS followed by
	    an l and only found in words of foreign origin. The actual phonetic
	    process is the palatalization of the l. We chose not to deal
	    with this as a palatalization process.

	 $  prefixes those vowels in certain roots which undergo
	    ellipsis in affixation. It just provides context
	    for rule application and disappears on the surface.

	 '  separates the proper noun stems from any suffixes
	    for proper ortography. it is transparent to certain
	    rules and opaque to some.
	
Since .txt file is processed with KGEN, we have exploited the fact that
KGEN treats  a*b* as actually (a*b*)*.

The final analysis of the input is presented as a LISP list
as attribute value pairs. For example
the analyses of  "alInmIS" is performed as follows:

al+Hn+ymHS     ((*CAT* N)(*R* "al")(*POSS* 2SG)(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG))
al+nHn+ymHS    ((*CAT* N)(*R* "al")(*CASE* GEN)(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG))
al$In+ymHS     ((*CAT* N)(*R* "alIn")(*CONV* V "")(*TENSE1* NARR)(*AGR* 3SG))
al+Hn+mHS      ((*CAT* V)(*R* "al")(*VOICE* PASS)(*CONV* ADJ "mis"))
al+Hn+mHS      ((*CAT* V)(*R* "al")(*VOICE* PASS)(*TENSE* NARR)(*AGR* 3SG))
alIn+mHS       ((*CAT* V)(*R* "alIn")(*CONV* ADJ "mis"))
alIn+mHS       ((*CAT* V)(*R* "alIn")(*TENSE1* NARR)(*AGR* 3SG))


where *CAT* is syntactic category
*R* is root
*POSS* is possessive suffix
*TENSE1* and *TENSE2* indicate tense suffixes
*AGR* is the verbal agreement
*CONV* is a syntactic category conversion.
*VOICE* is verbal voice
*COMP* is compound verb
*SENSE* is negation suffix
*CASE* is case
*AGR* is plural/singular etc.
*SUB* indicates a subcategory
*SF* indicates a semantic feature

Most other keywords are self-explanatory.

Not every morphological breakdown may be semantically very plausible.

The analysis for the rather synthetic adverb
 osmanlIlaStIramayabileceklerimizdenmiSsinizcesine
((behaving) as if your are one of those who we will not be converting into an Ottoman")
is as follows:

osmanlIlaStIramayabileceklerimizdenmiSsinizcesine
  osmanlI+lAS+DHr+yAmA+yAbil+yAcAk+lAr+HmHz+DAn+ymHS+ZHnHz+cAsHnA
((*CAT* N)(*R* "osmanlI")(*SUB* PROP)(*CONV* V "las")(*VOICE* CAUS)(*SENSE* NEGC)
 (*COMP* "yabil")(*CONV* ADJ "yacak")(*AGR* 3PL)(*POSS* 1PL)(*CASE* ABL)(*CONV* V "")
 (*TENSE1* NARR)(*AGR* 2PL)(*CONV* ADV "casina"))


The lexicons are found in the following files:

morphotactics -> turk.lex
nominal roots -> isim.lex
some adjectives -> sifat.lex
compound nouns -> birlesikisim.lex
verbs 	 -> fiil.lex
pronouns -> zamir.lex
proper nouns -> ozelisim.lex
connectives -> baglac.lex
questions  -> soru.lex
exclamations -> unlem.lex
adverbs -> zarf.lex
special cases -> ozeldurum.lex
some acronyms -> kisaltma.lex
some technical jargon -> teknik.lex
postpositions -> ilgec.lex

The file words.uniq contains a large number of words extracted from
papers etc and has been used for testing purposes.


There are about 24K roots in the roots lexicons.

.rul file is generated by the KGEN compiler. 

The lexicon files not contain much documentation. A paper describing the whole
specification is in progress and will be available soon.


Any improvements and bug reports are welcome.

This specification is in the public domain but may not be used for commercial
purposes. This file must accompany any redistribution.

Kemal Oflazer
Bilkent University
Computer Engineering Department
Bilkent, ANKARA, 06533 TURKIYE

e-mail: ko@trbilun.bitnet
fax: (90) 4 - 266-4126
tel: (90) 4 - 266-4133

</p>
</div1>