		Description of contents of the ECI/MCI
		======================================

				31 Mar 94
				=========


 Id      Type         	   Language	Size    
        		                (K words)   
=================================================

 alb01	 Word list and Texts	Albanian		  205

	 (a) Albanian word list 32K words with syntactic classes
	 The Albanian dictionary of the 1984
	 published in Tirana by the Academy of Sciences.
	 (b) The novel "Koncert n fund t dimrit" by Ismail Kadare published
	 in Tirana.

 bul01	 Technical		Bulgarian		    5

	 A number of scientific papers from "Science" journal.

 chi01	 Newspaper		Chinese			 2895

	The PH text corpus described here contains 3.75
	million Chinese characters.  It is a collection of news from the
	China's official Xinhua (New China) news agency (hereafter XinHua)
	during a period from January 1990 to March 1991. 
	It is GB coded with word and phrase boundaries marked.

 cze01   Newspaper	        Czech			  726   

	 Newspaper Texts (Lidove noviny, Literarni noviny)

 cze02   Newspaper	        Czech			 4000   

	 Newspaper Texts (Lidove noviny, Literarni noviny)

 dut01   Newspaper	        Dutch			  600  

	 Articles from the student newspaper Universiteitskrant of the
	 University of Groningen from the academic years 1990/1991 and
         1991/1992.

 dut02	 Mixed			Dutch			 5203

	 A large Dutch corpus from INL including transcripts of radio
	 programs, newspaper and magazine issues and some technical texts.

 dut03	 Mixed			Dutch			  128

	 A continuation of dut02.

 eng01   Novels			English			  241   

	 Three English novels from the OTA collection:
	 Thomas Hardy 		'Far from the Madding Crowd'
	 George Eliot 		'Silas Marner'
	 Charles Dickens 	'A Christmas Carol'

 eng02   Fiction		English			  900   

	 The Complete Sherlock Holmes, Sir Arthur Conan-Doyle.

 est01   Mixed			Estonian		  100   

	 Extracts from general fiction and prose.

 fre01   Newspaper		French			 4121 
  
	 Text from Le Monde newspaper, consisting of articles from September
	 and October 1989, and January 1990.

 gae01   Dictionary		Gaelic			  141   

	 MacBain, Alexander, "An etymological dictionary of the
	 Gaelic language", Gairm Publications, 1982
	 1st edition - 1896 revised 1911

 ger01   Sentence list		German			   20   

	 Lists of german sentences - tagged with some syntactic info.
	 The sentence test suite of DiTo, a linguistic database
	 for diagnostics in the syntax components of NLP systems.

 ger02   Newspaper		German			  191   

	 German Newspaper articles from VDI-Nachricten 1990-1991

 ger03	 Newspaper		German			34291

	 Frankfurter Rundschau Newspaper text

 ger04	 Newspaper		German			 7376

	 Donau Courier newspaper texts

 gre01   Mixed			Greek			 2515   

	 Newspapers, periodicals, popular fiction 1976-1990;

 ita01   Fiction		Italian			   13  
 
	 6 short stories by G.Verga

 ita02 	 Mixed			Italian			 3491

	 Selection of books and newspaper texts

 ita03   Newspaper		Italian			  303   

	 Corpus of Italian newspapers (La Stampa, Il Mattino)

 jap01   Dictionary		Japanese		  203   

	 EDICT Japanese/English dictionary, Japanese Constitution

 jap02	 Technical		Japanese		  148

	 Japanese version of the ITU CCITT data.

 lat01	 Poetry			Latin			   75

	 Vergil, Aeneid, books I - XII
	 Vergil, Georgics, books I - III

 lit01	 Fiction		Lithuanian		   20

	 "KOLEKCIONIERIUS" Story

 mal01   Technical/Novels	Malay			  563

	 A collection of original Malay texts and translations from
	 English, mainly technical books with some novels.
	 From University Sains Malaysia and Dewan Bahasa & Pustaka (publishers)

 mul01   Financial		En/Fr/Ge		  566   

	 Financial reports from Union Bank Switz. (most french-german)

 mul02   Technical		Fr/Ge/It		  177   

	 Avalanche bulletins 1986-1991 (ca. 40 per year/250 words)
         Swiss Federal Institute for Snow and Avalanche Bulletins.
         (Very little Italian)

 mul03   Legal			Fr/Ge/It		  227   

	 Text of Swiss Civil Code 

 mul04   Technical		En/Fr/Sp		13497

	 International Telecommunications Union CCITT handbook

 mul05	 Legal			En/Fr/Spa		 5000

	 International Labour Organisation "Official Bulletin, B Series":
	 "Reports of the Committee on Freedom of Association of the Governing 
	 Body of the ILO and related material 1984-1989".

 mul06   Technical		9 EC langs		  219   

	 The announcement text of the EC Esprit program.

 mul07   Sentence list		En/Fr			   12   

	 BABEL project data - French business sentences and English
	 translations.

 mul08   Novel			En/SC			  386  

	 George Orwell's "1984" in English, Serbian, Croatian and
	 Slovenian versions. 

 mul09   Technical		5 EC langs		  248 

	 ScanWorX User's Guide (Optical Character Reader) 

 mul10	 Mixed			English/French		   19

	 HCRC MT Evaluation Corpus: French/English parallel texts

 mul11	 Financial		German/French		  615

	 Financial Reports from CREDIT SUISSE

 mul12	 Legal			Danish/Spanish/English	 1199

	 The machine-readable 'Civil Law Corpus' from the 
	 Copenhagen Buisness School (Not parallel)

 mul13	 Novel			Uzbek/English		   72

	 Usbek Novel 'rk Freedom' with English interlineal translation

 nor01   Fiction/Drama		Norwegian		 2226   

	 Collection of texts Bokmaal & Nynorsk, some novels and some
	 Ibsen plays.

 por01   Mixed			Portuguese		  675   

	 An extract from the Borba/Ramsey corpus of Brazilian Portuguese.

 rus01	 Technical		Russian			  364

	 Technical reports (computer related)

 ser01   Fiction		Serbian			  700   

	 Short stories and novel extracts

 spa01   Speech			Spanish			 1041

	 Transcribed Spanish speech from 
         CORPUS ORAL DE REFERENCIA DEL ESPANOL CONTEMPORANEO 1991-1992

 spa02   Newspaper		Spanish			  447   

	 1 week of local Spanish newspaper "Sur" from April and Sept 1991.

 spa03   Newspaper		Spanish			  830   

	 "El Diario Vasco" newspaper articles 1991

 swe01   Mixed			Swedish			 1718   

	 A Fragment of SUC: the Stockholm-Umea Corpus of modern
         written Swedish. Text extracts (~2000 words each) from
	 books and newspapers published after 1990.

 tur01   Dictionary		Turkish			  173   

	 pc-kimmo rule specification and word lists for turkish morphology

 tur02	 Newspaper		Turkish			  110

	 This is news text excerpted from the Anatolia New Agency feed
	 covering roughly Sept/Oct 1992. Aproximately 10% of the total.

Total					            approx 99 M words
