YU-CORPUS  (Serbo-Croatian text corpus)

  This is a text corpus consisting of approximately 700 000 words of Serbo-
Croatian. The texts are taken from modern (i.e. primarily post-World War II)
Yugoslav fiction and all Serbo-Croatian-speaking areas are represented:
Serbia, Croatia, Montenegro, and Bosnia-Hercegovina.

  The corpus was compiled by scanning books of fairly high printing quality
(one of the parameters of text selection, I must admit). My equipment was a
Macintosh computer with 4 Mb of RAM and a (French) OCR program called
AutoREAD.

  Each file consists of prose work(s) by an author who can be identified
by the file name. All text files are zipped and must thus be transferred in
binary mode and unzipped before use. The files are of approximately equal
size, namely about 300 000 bytes/50 000 words.

  The texts are (when unzipped) pure ASCII (8 bits) texts. They are all
in the Latin alphabet - even when the book was printed in Cyrillic.
I use the texts with Nota Bene's word processor and text base facilities,
so the ASCII values of the special Serbo-Croatian characters follow the
Nota Bene standard which is:

  D w/stroke = ASCII 127 = \177 = &Dstroke;
  S w/hachek = ASCII 156 = \234 = &Scaron;
  s w/hachek = ASCII 157 = \235 = &scaron;
  C w/hachek = ASCII 220 = \334 = &Ccaron;
  c w/hachek = ASCII 221 = \335 = &ccaron;
  c w/acute  = ASCII 222 = \336 = &cacute;
  C w/acute  = ASCII 223 = \337 = &Cacute;
  d w/stroke = ASCII 235 = \353 = &dstroke;
  Z w/hachek = ASCII 241 = \361 = &Zcaron;
  z w/hachek = ASCII 242 = \362 = &zcaron;

	[ DMCK These non-iso characters have been converted into the
	  SGML entities given above in the last column.
          NB hachek has been translated as caron. Correct? ]

  The only manipulation I'm guilty of is the splitting of long paragraphs:
This is because I use paragraph markers as separators between entries, so
when paragraphs become longer than one half of a screen page, I divide
them into smaller parts by means of the combination ASCII-179 CR ASCII-179.
In Nota Bene's text base system ASCII-179 is ignored (is considered as a
separation marker), but displayed on-screen.
  Those who want to use these texts on different systems must be aware of
this fact and replace the sequence ASCII179-CR-ASCII179 by CR (i.e. carriage
return) or SPACE (which gives the original text).

	[ DMCK ASCII-179 CR ASCII-179 sequences have been replaced by
          spaces throughout, and the text has been reformated. ]

  The rest of this file describes the contents of the whole set of files
which are:

bozovic.eci          isakovic.eci          kapor.eci
krleza.eci           lalic.eci             marinkov.eci
mihailov.eci         nazor.eci             pavlicic.eci
savic.eci            selimov.eci           tisma.eci
novele1.eci          novele2.eci           antolog1.eci


  I am still working on adding new files to the corpus, and I hope to be
able to accomplish my first goal, 1 million words, within a year or so.

  All texts have been proofread only once, so I cannot guarantee that there
are no misspellings left. In my opinion, however, the texts are fully usable
as they are.
  If you find any misprints, please let me know!

  Have a good time!

  Henning Moerk
  Slavisk Institut
  Aarhus Universitet
  Ny Munkegade 116
  8000 Aarhus C

  tel:     +45 86 13 65 55
  fax:     +45 86 19 21 55
  e-mail:  slavhenn@aau.dk


NB: NO COMMERCIAL USE OF THIS CORPUS IS ALLOWED. USAGE IS
LIMITED TO RESEARCH PURPOSES.

==========================================================

-------------------------------------------
YU-CORPUS index  (yu-corp.txt):
-------------------------------------------
June, 1992:  4242310 bytes
              728952 words
-----------------------------------------
BOZOVIC.ECI

Sasa Bozovic:
Ratne ljubavi, Beograd 1985, s. 5-110
Tebi, moja Dolores, Beograd 1984, s. 9-95

bytes:       293917
entries:        919
keywords:     11437
words         51189
-----------------------------------------
ISAKOVIC.ECI

Antonije Isakovic:
Tren 2 - Kazivanja Ceperku
Beograd (prosveta) 1983
s. 7-214; 224-241

Bytes:      305703
Entries:      2654
Keywords:    13553
Words:       52496
-----------------------------------------
KAPOR.ECI

Momo Kapor: Una
Zagreb (Znanje) 1983
s. 5-100

Momo Kapor: Zoe
Zagreb (Znanje) 1984
s. 5-141

Politika, 16. jul 1989, s.9: 011
Momo Kapor:
"Bre"

Politika,  28., 29. og 30. november: 011
Momo Kapor:
"Na dan tvog rodjenja"

Bytes:     292722
Entries:     1740
Keywords:   13966
Words:      47587
-----------------------------------------
KRLEZA.ECI

Sabrana djela Miroslava Krleze.
Svezak osmi: Novele
Zagreb (Zora) 1955

1. In extremis
2. Vjetrovi nad provincijalnim gradom
3. Smrt bludnice Marije
4. Smrt Rikarda Harlekinija

Sabrana djela Miroslava Krleze.
Svezak treci: Povratak Filipa Latinovicza, s.9-121
Zagreb (Zora) 1954

Bytes:       313461
Entries:       1050
Keywords:     14668
Words:        50734
-----------------------------------------
LALIC.ECI

Mihailo Lalic: Lelejska Gora
Beograd "Nolit" 1962
s. 7-136; 237-289

bytes:     303658
intries:     1625
keywords:   11376
words:      55386
-----------------------------------------
MARINKOV.ECI

Ranko Marinkovic: Ruke
Sarajevo "Svjetlost" 1964, s.5-187

Bytes:       288916
Entries:       1300
Keywords:     13174
Words:        49389
-----------------------------------------
MIHAILOV.ECI

Dragoslav Mihailovic: Uhvati zvezdu padalicu.
Pripovetke, Beograd 1984:
1. Pas
2. Barabe, konji i gegule
3. Ujka Dragi sedi pod jabukom
4. Sukar mesto
5. Trece prolece Svete Petronijevica

Dragoslav Mihailovic: Kad su cvetale tikve.
Roman, Beograd 1984
1. Gl. 1-24 (sve)

Bytes:        300856
Entries:        2045
Keywords:      11641
Words:         54252
-------------------------------------------
NAZOR.ECI

Pet stoljeca hrvatske knjizevnosti.
Vladimir Nazor II, Zagreb 1965
1. Veli Joze    (s.  7- 67)
2. Albus Kralj  (s. 68- 84)
3. Boskarina    (s. 85-100)
4. Otac         (s.101-122)
5. Voda         (s.154-183)

Bytes:       288903
Entries:       1877
Keywords:     12158
Words:        51252
-------------------------------------------
PAVLICIC.ECI

Pavao Pavlicic: Eter
Zagreb (Znanje) 1983, s.5-135

Pavao Pavlicic: Rakova djeca
Zagreb (Znanje) 1988, s.5-98

Bytes:      308394
Entries:      2181
Keywords:    10156
Words:       53094
-------------------------------------------
SAVIC.ECI

Milisav Savic: Mladici iz Raske. Pripovetke.
Beograd "Slovo Ljubve" 1977

Milisav Savic: Ujak nase varosi.
Beograd 1977
 1. Konj
 2. Vuk
 3. Mecka
 4. Kapetan Vuk
 5. Filip
 6. Odbrana varosi od vestica
 7. Uspon i pad naseg fudbalskog tima

Bytes:      309726
Entries:      1905
Keywords:    12957
Words:       52889
-------------------------------------------
SELIMOV.ECI

Mesa Selimovic: Dervis i smrt
Sarajevo (Svjetlost) 1968
Chapter 1-7 (p. 9-157)

Bytes:      281283
Entries:      1331
Keywords:    10164
Words:       49565
-------------------------------------------
TISMA.ECI

Aleksandar Tisma: Upotreba coveka
Beograd "Nolit" 1977, s.7-188; 213-223

bytes:       325239
entries:        909
keywords:     15574
words:        53791
-------------------------------------------
NOVELE1.ECI

SNJ = Suvremena jugoslavenska novela
Zagreb "Znanje" 1965

1. ASANOV (11083 bytes) =
 Sreten Asanovic: Gluho
 SJN, pp. 9-13

2. BULATOV (45934 bytes) =
 Miodrag Bulatovic: Insekti
 SJN, pp. 14-35

3. COLANOV (22237 bytes) =
 Voja Colanovic: Nezgoda
 SNJ, pp. 42-51

4. DAVID (10545 bytes) =
 Filip David: Cuvaj se, Benjamine!
 SJN, pp. 52-56

5. DISOPRA (39955 bytes) =
 Nikola Disopra: Iskra zivota
 SJN, pp. 57-75

6. DJONOVIC (43762 bytes) =
 Marko Djonovic: Pokojnica i znak
 SJN, pp. 76-94

7. FETAHAG (8728 bytes) =
 Sead Fetahagic: Sretan svrsetak ljubavi
 SJN, pp. 95-99

8. IBRISIM (9183 bytes) =
 Nedzad Ibrisimovic:
 Zurno ispricao Samilov mladji sin
 SJN, pp. 120-124

9. KATUSIC (42098 bytes) =
 Ivan Katusic: Proresetano jedro
 SJN, pp. 131-151

10. KUSAN (21256 bytes) =
 Ivan Kusan: Veliki dan
 SJN, pp. 182-191

11. LAZIC (33166 bytes) =
 Zika Lazic: Nepoznata svitanja
 SJN, pp. 192-207


bytes:       288246
entries:       1379
keywords:     13834
words:        49145
-------------------------------------------
NOVELE2.ECI

SNJ = Suvremena jugoslavenska novela
Zagreb "Znanje" 1965

1. MAJDAK (23747 bytes) =
 Zvonimir Majdak: Jedna noc
 SJN, pp. 208-218

2. MILANKOV (28528 bytes) =
 Momcilo Milankov: Pasians
 SJN, pp. 219-230

3. SLAMNIG (20300 bytes) =
 Ivan Slamnig: Prica o Zvjezdani
 SJN, pp. 354-363

4. TRIFKOV (22717 bytes) =
 Risto Trifkovic: Samo jedna bubuljica
 SJN, pp. 444-453

... (more to come)

bytes:       95293
entries:       295
keywords:     5516
words:       16329
-------------------------------------------
ANTOLOG1.ECI

1. BARBIERI  (58332 bytes) =
 Veljko Barbieri: Trojanski konj
 FORUM, Godiste XIX. Knjiga XXXIX. Broj 3.
 Zagreb, mart 1980, s.400-422
 

2. BULATOV1  (61828 bytes) =
 Miodrag Bulatovic: Crveni petao leti prema nebu
 Beograd "Prosveta" 1963
 Gl. 1-7  (s.7-49)

3. DRASKOV  (64508 bytes) =
 Vuk Draskovic: Molitva druga
 Beograd 1988
 Gl. 1-3, 9

4. KULENOV  (61325 bytes) =
 Kulenovic, Skender: Pripovijetke
 Sarajevo "Veselin Maslesa" 1971
 s. 37-69: Rakije
 s. 70-77: Sveto pravilo trojno

bytes:    245993
entries:    1243
keywords:  12584
words:     41854
-------------------------------------------

Notes on text markup
====================

BULATOV

udovima. lzgledalo	=> udovima . </s> <s> Izgledalo
sanduk. lza 		=> sanduk . </s> <s> Iza 

SAVIC

(1) 
<s> Od Filipa ni traga. &s1;ta je s njim , da mu se ne&s1;to
nije desilo ? </S>
===>
<s> Od Filipa ni traga , &s1;ta je s njim , da mu se ne&s1;to
nije desilo ? </S>

(2)
<s> - Ko te &s1;i&s1;a. &c1;i&c1;a ! - ka&z1;e mu jedna ,
frcokla i namigu&s1;a . </S>
===>
<s> - Ko te &s1;i&s1;a , &c1;i&c1;a ! - ka&z1;e mu jedna ,
frcokla i namigu&s1;a . </S>

(3)
<S> Ujak je &c1;ak tvrdio da je Filip. kome su ve&c2; mnogi
u&s1;li u tragove , postao sasvim nevidljiv : niko ne zna da
li on u tom trenutku &c1;u&c1;i u kro&s1;nji neke lipe , ili
sedi na klupi u parku ili se muva na pijaci , budno motre&c2;i
na sve varalice i lopove . </S>
===>
<S> Ujak je &c1;ak tvrdio da je Filip , kome su ve&c2; mnogi
u&s1;li u tragove , postao sasvim nevidljiv : niko ne zna da
li on u tom trenutku &c1;u&c1;i u kro&s1;nji neke lipe , ili
sedi na klupi u parku ili se muva na pijaci , budno motre&c2;i
na sve varalice i lopove . </S>

(4)
<s> " Drugarice i drugovi , kada sam polazio u partizane , tog
maglovitog julskog jutra 1941. godine , duboko su mi se u srce
urezale re&c1;i mog druga sa sportskog terena Prleta , koje je
izgovorio na brdu iznad na&s1;e varo&s1;i , na domaku &s1;uma . </s>
===>
<s> " Drugarice i drugovi , kada sam polazio u partizane , tog
maglovitog julskog jutra 1941 , godine , duboko su mi se u srce
urezale re&c1;i mog druga sa sportskog terena Prleta , koje je
izgovorio na brdu iznad na&s1;e varo&s1;i , na domaku &s1;uma . </s>

SELIMOV

The file selimov contains some occurances of the character
FF(hex) = \377(oct) = 255(dec) = ÿ(iso) = M-DEL .
They always appear in ÿAF ... ÿAE pairs

E.g.
siromaha, ÿAFjer su oni slomljena srcaÿAE. Molitvama 

ÿAFTajnu bo&z1;jeg jedinstva ne zna Ahriman. Pitaj Asafa,
on zna. Mo&z1;e li vrabac pojesti zalogaj kao Anka-ptica?
Mo&z1;e li jedna testija zahvatiti vodu velikog mora?ÿAE

u Smirni: ÿAFKad vidi&s1; da mlad &c1;ovjek stremi u nebo,
uhvati ga za nogu i svuci na zemljuÿAE.

me, ali, ÿAFvama va&s1;i poslovi, meni mojiÿAE. Nije 

I suspect that they are some kind of highlighting or emphasis
done by font change. Any ideas?
I've marked them with <em> ... </em>


Spelling changes

ga nije bilo. l ne podigav&s1;i => ga nije bilo , l ne podigav&s1;i
nije &c1;uo. ni&s1;ta	        => nije &c1;uo , ni&s1;ta
oti&s1;ao. tako bi sam          => oti&s1;ao , tako bi sam
Jusufa poslije. kad su          => Jusufa poslije , kad su
sobom. l zaka&s1;njelo          => sobom , l zaka&s1;njelo
ispran , krhak. voda            => ispran , krhak , voda
kako je.trebalo da bude         => kako je , trebalo da bude 

r.elka LEFT as it is

