This is the European Corpus Initiative Multilingual Corpus I

All use of this corpus is subject to a licence agreement:  see the
file LICENCE.

The ECI is a volunteer effort, sponsored by the Association for
Computational Linguistics (European Chapter), carried out at the Human
Communication Research Centre, University of Edinburgh (HCRC) and
Institute Dalle Molle pour les etudes semantique et cognitives,
University of Geneva (ISSCO), with modest additional financial support
from the European Network in Language and Speech (ELSNET) and the
Network for European Reference Corpora (NERC).

We are very grateful to all those who made material available for this
effort, none of whom received any compensation for their
contributions: Without their generosity and prior effort, there would
be no ECI/MCI.  Please respect the restrictions, if any,
they have specified on the use of their contributions -- these are
recorded both in the top-level LICENCE file and in the CPYRIGHT files
and headers of the individual corpora themselves.

---------------------
I.  Directory summary
---------------------

bin:	msdos	Contains MSDOS .exe files of some basic tools (gzip,
		perl and sgmls)
  	unix	Contains UN*X scripts for corpus manipulation


data:	Contains the actual corpus data

doc:	Documentation files

lib:	fonts	.bdf versions of fonts for ISO-LATIN-5 and -7
		(Cyrillic and Greek)
	tei	ECI and TEI files needed for SGML applications

src:	mac	BinHexed version of gzip
	msdos	PKZipped sources of gzip, perl and sgmls
	perl	Perl scripts for corpus manipulation
	unix	sources for gzip and sgmls
        
--------------------
II.  The Corpus Data
--------------------

The actual data is in files with a .eci suffix two levels down below
the data directory: For information about the directory structure of
this directory, see the file doc/dirstrct.txt, for a listing of
the titles of the corpora themselves see mci.bib and for a brief summary
of their contents see doc/corpdesc.txt.

Most of the data is marked up in TEI-compliant SGML -- see mci.edt for
discussion, and the bin and src directories for tools to assist in
processing and accessing the data.  The top-level file mci.sgm
provides an SGML way in to the corpus as a whole, or for selected
parts of it -- again see mci.edt for further instructions.

Do not despair if you either have no interest in SGML markup, or no
facilities for exploiting it, but just want 'plain text': The bulk of
the data provided here, including all that under directories data/eci1
and data/eci2, observes what we call the Text/Markup Invariant:

  Every line in a data file (.eci file) is either all text or all
  markup, and a line is a markup line if and only if it begins with a
  left angle bracket (<).  This makes restricting your processing to
  'plain text' very easy -- just look only at lines which begin with
  some character other than <.

The UN*X shell script bin/unix/textonly, introduced
below, both implements this for UN*X users and documents it for
others.

Note that the file lib/tei/tei.dcl is an SGML declaration which is
required for any SGML application processing ECI/MCI files.

--------------------
III.  Character Sets
--------------------

The majority of the data in ECI/MCI is encoded using the ISO-8859-1
(ISO Latin 1) character set.  Some use is also made of ISO Latin 2
(for Czech), ISO Latin 5 (Cyrillic, for Bulgarian and Russian) and 7
(Greek).  All of these character sets have 256 characters, i.e. they
use 8 bits per character.  They are also all virtually identical to
ASCII for the first 128 character codes.  Some support is provided,
mostly for UN*X environments, for displaying and printing the full
character inventory of these character sets -- see src/unix/isoscrpt
and lib/fonts for more information.


---------------------------------------
IV.  Examples of use on a UN*X system:
---------------------------------------

To see just the header information on the corpus as a whole, type

	eci mci.sgm

To process the ENTIRE corpus through the sgmls program, type

	eci -iall mci.sgm

at the top level.

To process all the Spanish corpus components similarly, type

	eci -ispa mci.sgm

at the top level.

To retrieve just the text of all the German components, type

       textonly data/*/ger*/*.eci

at the top level.

Note that all the above require that the environment variable ECI_ROOT
be set to the full pathname of the CD (e.g. "/cdrom"), that your
working directory is in fact $ECI_ROOT and that $ECI_ROOT/bin/unix is
in your path.

-----------------------------
V.  Finally, "Caveat Lector"
-----------------------------

These corpora came to us in every conceivable format, character set,
state of existing preparation and markup.  Previous annotation
sometimes introduced errors that we have missed, and we know that we
have introduced some errors of our own, despite our best efforts.
This is almost inevitable since the size of these corpora required us
to use semi-automatic markup and correction schemes.  None-the-less we
believe that our efforts have added more value than they have taken away,
and hope the results will be of use.

We would be glad to hear of errors that you discover when using these
corpora, or to receive tools for aiding in their exploitation -- send
e-mail to eucorp@cogsci.ed.ac.uk.  We will endeavour to keep all
recipients of the ECI/MCI informed of any such submissions.
