Estonian corpus =============== (1) Character coding info We have used extended ASCII for coding of some especial Estonian characters. They are: a Umlaut - code 132, A Umlaut - code 142, o Umlaut - code 148, O Umlaut - code 153, u Umlaut - code 129, U Umlaut - code 154 (all as in German); a "o" with a "^" over it - code 147. Then "Õ" means a capital "O" with a "~" over it, "&z;" means a "z" with a "v" over it, "&Z;" means a capital "Z" with a"v" over it, "&s;" means a "s" with a "v" over it, "&S;" means a capital "S" with a "v" over it, "&c;" means a "c" with a "v" over it, "&C;" means a capital "C" with a "v" over it. "&t;" occurs once (originally t^) - it may be an error (2) Documentation suipplied by Turku MANUAL OF ENCODING TO ACCOMPANY THE TARTU COMPUTER CORPUS OF THE CONTEMPORARY ESTONIAN LANGUAGE Translated from Estonian by Madis Saluveer Tartu April 1992 The following pages present the coding manual used during the editing and computer mark-up of the Estonian texts. This manual is based on the "Manual of Information" for the LOB corpus by S. Johansson, G. Leech and H. Goodluck and was updated according to the requirements of the Text Encoding Initiative TEI P1 Draft Version 1.0. Coding manual: Table of contents 1 Organisation of the material 2 Coding key 3 Textual material included/excluded 4 Spacing and ordering 5 Typographic shifts 6 Headings 7 Symbols denoting paragraphs and lines 8 Sentence mark-up 9 Direct speech 10 Quotations 11 Capitalisation and proper names 12 Lists 13 Foreign language material 14 Other non-contemporary material 15 Abbreviations 16 Hyphen and dash 17 Mathematical expressions 18 Typographical errors 1 Organisation of the material 1.1. The corpus starts with the marker for the first text and ends with the end-of-corpus symbol. 1.2. The length of a text sample is 2,000 + 200 words. The sample text must be a comparatively integral whole. A word is defined ortographically as a character or sequence of characters surrounded by blank spaces (and including no blank spaces). 1.3. Each corpus text starts with a tag giving its absolute number and category number and ends with the end of-text symbol and a figure giving the number of words in the text. 1.4. Each corpus text is preceded by a standardised data description. The data description is marked by the tag which is organised as follows: <title> 1.5. Only those blank spaces in the file description which can be supplied for a given text are filled in. The other remain unfilled. 1.6. Some data are permanent, ie the same for all texts. They include: 2. Coding key 2.1. Alphanumeric characters represent themselves (including o, a, o, u) A = A B = B C = C 1 = 1 a = a b = b c = c 2 = 2, etc. except for = oe = ae = ss z = z^ s = s^ O = O^ Some other foreign letters untypical for Estonian are marked the same way, eg r = r^, e , etc. 2.2. Other symbols In general, punctuation marks and other symbols used for text delimitation represent themselves: , = , . = . ? = ? / = / ( = ( etc. It should be noted: - = hyphen or minus is encoded without blank spaces - = dash is preceded and followed by a blank space " " = the begin and end quotation marks are excluded and substituted by the markers and 2.3. As regards numerical expressions, see 17.4. 2.4. Words are not broken at the end of a line. 2.5. All changes in texts are marked by English words or letters included within the symbol < > where < > denotes the beginning of the change and denotes its end:

= begin new paragraph or line

= end of paragraph = begin sentence = end of sentence, etc. 2.6. The coding markers may be simple (consisting of only one main symbol) or compound (the attribute following the main symbol represents an aspect or value of the main symbol). Attributes and their values may vary and will be presented in the corresponding subparts of the Manual. Attributes are separated from their main symbol by a blank space. The value of an attribute is presented after the = symbol without a blank space: . 3 Textual material included/excluded 3.1. Sentences used as "tantalizers" at the beginning of prose texts are included. They are preceded by a paragraph marker

and followed by the end-of-paragraph marker

. The same applies to "story so far" summaries. 3.2. Commentaries and notes extraneous to the source are omitted and represented by a comment tag . 3.3. The marker has the following attributes: 1) n denoting the number of the comment, eg 2) type denoting the type of the comment,eg 3) source identifying the author of the comment, eg or =comp(iler)> or =transcriber>, etc. 4) place indicating the locality of the comment in the text, eg or =end> or =inline> or =app(endix)1> All the attributes will be encoded in the above order. 3.4. Long foreign quotations are excluded and represented by the tag in their original place. Foreign quotations up to 30 words are included. Even longer foreign quotations may be included if their exclusion will seriously hamper textunderstanding The marker has the attribute "foreign" which identifies the language of the quotation, eg 3.5. Long poetry quotations are excluded and replaced by the marker in the original place in the text. 3.6. As regards mathematical expressions see 17.3. 4 Spacing and ordering 4.a. Spacing 4.1. A single space indicates a typographic word-boundary in the source text. A space may not occur within a word. 4.2. / occurring between words is followed by a space, except in the case of and/or (ja/voi) which is coded as a single word. Np space is inserted after / in numerical expressions. 4.3. A space follows the punctuation marks . , ; : ? ! as in printed texts. These marks are not preceded by a space. 4.4. A dash is precede and followed by a single space as in printed texts. A hyphen is neither preceded nor followed by a space. 4.5.An opening bracket ( is preceded but not followed by a space. A closing bracket ) is followed but not preceded by a space. 4.6. A space is not used between markers and units denoted by them. A space separates markers and units not related to each other. No space occurs between two markers denoting the same unit. 4.b. Ordering 4.7. If several markers are applied to one and the same word or phrase at the same point in the text then the order of the opening markers is as follows: 1) the typographical shift marker occurs before the proper name marker , 2) the typographical shift marker precedes the sentence marker , 3) the proper name marker precedes the abbreviation marker , All elements have been replaced by to be conformant with TEI 2 DTD 4) the sentence marker precedes the quotation marks marker , 5) the quotation marks marker precedes the quotation marker . For example, USA 4.8. If several markers are applied to one and the same word or phrase at the same point in the text then the ordering of opening and closing markers must be symmetrical, ie the leftmost opening marker must be the same as the rightmost closing marker. In other words, the order of opening and closing markers is reversed, eg Ivanov 4.9. If several markers occur at the same point in the text and their closing order is not the same then they should be placed so that the markers denoting longer units will remain outside the markers denoting shorter units, eg See kohta palusime raakida Harri Aguraiujal Tanavusel ERKI moeshow'l oli olulisel kohal

ABSURD.

5 Typographic shifts 5.1. The beginning of a typographic shift in the text is denoted by the marker and its end by . 5.2. The shift type is indicated with the help of attribute "rendition". "Rendition" may have the value: italic, spaced, sans, bold, underscored, etc., eg 5.3. The symbol of typographic shift may occur within a word, eg Sona anti... 5.4. The symbols . , : ? ! and other punctuation marks are regarded as neutral between different shift types. The end-of-sentence punctuation mark occurs before the typographic shift end marker when the whole sentence is in the corresponding shift. 6 Headings 6.1. The headline is preceded by the marker and followed by the marker . The markers are placed on a separate line. 6.2. In the source text headings are sometimes placed not at the beginning of a text but somewhere inside the text. In corpus they are appropriately repositioned at the beginning of the text. 6.4. In the original headings may be characterised by special typographic features (only capital letters, figurative shifts, etc.). Such headlines will be unified to the first capital letter and the remainder be printed in small letters. 6.5. Other editorial headings (eg "continued on p. 3", "Next week:...", etc.) are excluded. 6.6. The occurrence of "Editor" or the name of the author at the end of the article is coded with the heading markers. 7 Symbols denoting partagraphs and lines 7.1. The beginning of a new paragraph is indicated by the symbol

placed on a separate line before the paragraph. The end of a paragraph is indicated by the symbol

placed on a separate line. 7.2. A paragraph is always indented. 7.3. The paragraph marker is also used: a) to denote the lines of a poem b) to denote breaks in the text, such as a two-line space, asterisks, etc. c) to mark sub-headlines d) to indicate "tantalisers" at the beginning or within a text(eg words indicated by a typographical shift in a newspaper, etc.). 8 Sentence mark-up 8.1. The beginning of a sentence is indicated by the marker , the end - by . Both markers are placed on the same line with the sentence. 8.2. The marker indicating the beginning of a sentence is not used at the beginning of a headline. If a heading consists of more than one sentence than the second sentence is marked by at the beginning and at the end. 8.3. Sentences are separated by the punctuation marks . ? ! ... . A semicolon is never considered a punctuation mark separating sentences. 9 Direct speech 9.1. The markers and are used to indicate direct speech. The quotation marks used in the original are excluded. 9.2. When the reporting clause precedes or follows direct quotations/direct speech both of them are marked by separate sentence markers. In addition, the reporting clause and the direct speech together are marked together as one sentence: Isa kusis akki:Kas tahate homme kalale tulla? Sundis poiss,lausus Andres. 9.3. When the reporting clause is placed in the middle of direct speech consisting of one sentence, the whole is marked with sentence markers and direct speech markers. The reporting clause is separated from the rest of the sentence with markers and Aga, huudis Maie, kes siis niiviisi tohib teha! 9.4. When the introductory verb is placed between two sentences in direct speech then all the three form separate sentences marked by the sentence markers. In addition the whole is marked with the sentence marker: Oh! huudis Maie: Kull on tore! 9.5. An included quotation is separated from the preceeding and following reporting clause with sentence markers and the reporting clause is considered as one sentence whose end-of-sentence marker comes at the end of the second half of the reporting clause: Kui tudruk utles:Seda poleks ehk tarvis olnud;vastas Jaan kindlalt: Oli tarvis 9.6. When the name of the speaker is not mentioned in the reporting clause then the speaker is marked as follows: Kui tudrukutles:Seda poleks ehk tarvis olnud;vastas Jaan kindlalt:Oli tarvis. 9.7. In two special cases the presence of quotation marks is denoted but their essence has to be specified by the corpus user in the future: a) words or phrases of the meta-level Lause Tana on ilus ilm! on oeldistaiteline olemislause. Sona staadion kaandub nagu seminar. b) irony, transfer of meaning, etc.: Tuli aeg, mil mulle suruti pahe kuninga kubar, mil pidin teenima Vene tsaari. 10. Quotations 10.1 Text excerpts of another author directly indicated as such are considered as quotations. 10.2. A quotation is marked with and its beginning and at its end. Filoloogiadoktor Aarne Vinkel kirjutab: Liivi luulekeele rahvalikkus avaldub nii sellle sonavaras, lauseehituses kui ka varsi uldises intonatsioonis. Kohati arendab Liiv stroofide kaupa rahvalaulu-laadilist varsket niidistikku. 10.3. When a quotation begins in the middle of a sentence or forms an inseparable part of the reporting clause no sentence marker is used within the quotation: Fr. Tuglas on selle kohta oelnud, et koik meie paremad kirjanikud ... on ikka midagi pooldanud, idealiseerinud voi polanud. 10.4. Whwn quotations extend over two or more paragraphs there are opening quotation marks at the beginning of each paragraph but closing quotation marks only at the very end of the quoted section. In such cases the marker is used at the absolute beginning of the quotation and at its absolute end. 10.5. Parts of a quotation or dialogue containing markedly deviant usage from standard language are denoted by special markers (see 13-14). 11 Capitalisation and Proper Names 11.1. Continuing capitalisation is not indicated separately, eg USA HOIATAB NSV LIITU... 11.2. The marker is used to denote proper names.It may have a specifying attribute "type". 11.3. The attribute "type" may have values person, place, institution, product, event: Jaak Mand tootab Tallinnas TTUs. 11.4. When the user cannot specify the value of the attribute "type" it is adviseable to use the marker only. 11.5. The markers coding proper names should cover a proper name as widely as possible, and pay no attention to its definition: koondise "Marat" 2. tsehhi ametiuhingukomitee esimees 11.6. In case of contracted constructions where one main word has several attributes the whole construction should be inserted between common markers, eg: Tartu 1., 2., 3. ja 5. Keskkool 11.7. When a propper name has been turned into a lower case attribute it will not be marked as a proper name: Toi talle brusseli kapsast. 11.8. When a lower case foreign word such as de or von occurs as part of a name it will be enclosed within the proper name markers: President de Gaulle 12 Lists 12.1. The beginning of a list is marked with the marker , the end - which are placed on a separate line. 12.2. When a list is ordered (numerically, alphabetically) the marker will have an attribute "type=ordered". 12.3. Every item of a list will be preceded by the marker and followed by the marker and placed on a separate line. The dash and numbers at the beginning of a list line will be included between markers. For example, the list 1) 2 muna 2)1 liiter piima will be marked as follows: 1)2 muna 2)1 liiter piima 13 Foreign language material 13.1. Words, expressions, sentences and paragraphs in a foreign language will be marked only when they are indicated by means of italics, boldface, capitalisation or any other typographic means. 13.2. The foreign words and expressions will be preceded by the marker and followed by the marker . 13.3. When possible the language of the foreign word is indicated with the help of an attribute "lang=", eg The attribute has 3 digits to indicate the language by its English name. 13.4. Foreign titles of books, operas, etc are marked unless they consist simply of a name: La Traviata, but: Fidelio 13.5. As regards foreign abbreviations, see 15.5. 14 Other non-contemporary material 14.1. Material deviating from modern literary Estonian is marked with the markers and 14.2. This marker may have an attribute "type" with the values old, spont(aneous), dial(ectal), slang, etc. 14.3. Parts of a dialogue containing occasional non-standard characteristics (double negation, wrong government, etc.) will not be marked with this marker. 14.4. Quotations from sources published before 1918 are marked with the attribute "type=old". 14.5. Dialectal usage not mentioned in "Oigekeelsussonaraamat" (Dictionary of ) but mentioned in "Murdesonastik" (Dictionay of Dialects) are marked with the symbol : tetta 14.6. Short forms indicated by means of an apostrophy are not marked, eg ma'i (=ma ei). 14.7. A sentence or paragraph containing the features of spontaneous oral presentation will be marked as a whole by means of the attribute "spont(aneous)", eg: Tema ema pahandas:Mina ei taa! 14.8. Slang will be marked with the symbol . 14.9. Neologisms, nonce-forms, broken Estonian, etc. is denoted with the symbol , eg slaagivad=(laulavad) 15. Abbreviations 15.1. Abbreviations are coded by means of the markers and 15.2. The marker may have an attribute "full" This attribute has been changed to "expan" for TEI2 conformance representing the full form of a rare or difficult to understand abbreviation/acronym. The full form is presented in the nominative case: Jaan on TTU tudeng. 15.3. Abbreviations are coded whether or not they end in an abbreviation point. An abbreviation point will be followed by a closing marker. A case ending at the end of an abbreviation will be followed by the closing marker: U.S.A. 15.4. When an abbreviation and a name coinside then the abbreviation markers are placed closer to the marked word: USA 15.5.Foreign abbreviations are coded similarly to the Estonian ones. 15.6. Shortened forms are not marked as abbreviations: Mul' (pro mulle) 15.7. 13-nes, 12th, etc are. not marked as abbreviations. 15.8. Chemical formulae are marked similarly as the other abbreviations by means of the symbol . 15.9. An abbreviation marker occurring in the middle of a word is marked as other abbreviations: 8s. = 8 shillings 15.10. If a sentence ends in an abbreviation and in one full stop which may be both a sentence point or an abbreviation point then the full stop will be included between the abbreviation markers. 16 Hyphen and dash 16.1. The hyphen (-) is used within a word and is neither preceded nor followed by a space. 16.2. The dash is preceded and followed by a space. 16.3. A line-end hyphen is not coded, except where the hyphen is part of the normal spelling of the word (a hyphen in the middle of a compund word, etc.). Where spelling practice varies with regard to hyphenation, a coding decision has to be made as to whether the line-end hyphen is preserved or not. The hyphen is preserved: a) if dictionaries show that hyphenation is normal b) if the word in question is hyphenated elsewhere in the same text. 16.4. - meaning "to" (as in 1632-1690) is coded as hyphen, ie not separated by spaces. 17 Mathematical expressions 17.1. Mathematical expressions are where possible coded as themselves, eg + = %. 17.2. - between numerical expressions represents "minus" x in numerical expressions represents "multiplication sign" / in numerical expressions represents a divisor in fractions. 17.3. Numbers are presented without spaces between digits. 17.4. Numbers are coded by means of the symbol . 17.5. The marker may have an attribute "type" having the values cardinal, ordinal, fraction, percentage, certainty, and also an attribute "value" giving the numerical value of a mathematical expression. In case of certainty the attribute "value" is not used: sada viiskummend kuus kolmekumne seitsmendal 1/2 kumned tuhanded 17.6. Dates are always presented without spaces between the digits: 8.-12.mail 17.7. Dates are marked by the symbols and 17.8. The tag has always an attribute "value=yyyy-mm-dd". 21. veebruar 1990 Partial dates can be expressed in the "value" attribute by omitting a part of the value supplied: See toimus 3. augustil Jaan suri 1940.a. detsembris 17.9. In case of date range the attribute "value" may have fillers "from to", only "from" or only "to": Soda kestis 1942. martsist oktoobrini. 17.10. When dates are not fixed the attribute "value" takes on the filler "certainty": kolmekumnendate aastate lopus 17.11. Time expressions are denoted by the markers . 17.12. The tag