Multilingual Corpora in ECI/MCI

I.  Introduction

The thirteen multilingual corpora contained within ECI/MCI fall into a
small number of structural types, which in turn determine how they can
be accessed.  This document outlines the different types and how they
are structured.

II.  Parallel Sub-corpora

The corpora mul04 and mul05 consist of a number of parallel
subcorpora, each contained in a separate subdirectory (e.g. msp04),
with its own .sgm, .ent and .eci files.  The mulnn.sgm files
themselves include these sub-corpora in the same way that mci.sgm
includes the component corpora of ECI/MCI, that is, as SUBDOCs.

Mul12 is not really parallel at all, being comparable material
assembled by one site about the Danish, English and Spanish legal
systems, but it is structured in a similar way to mul04 and mul05.

III.  More or less parallel files

The corpora mul03, mul06, mul08 and mul09 consist of a number of
parallel files, one per language, all located in the same directory.
This directory and its files are structured similarly to those for a
monolingual corpus, so e.g. mul03a.eci, mul03b.eci and mul03c.eci are
the German, French and Italian versions respectively of the mul03
material.  To provide for aliasing (see below) subdirectories
(e.g. mfr03) are provided, with links to the parent directory for both
.eci and .ent files -- a seperate .sgm file is provided in each
subdirectory for access therefrom to a single language.

The corpora mul01 and mul02 are similar but less parallel.  They each
consist of a collection of varied material in various languages.
Mul01 is organised into subcomponents by material, with individual files
in different languages, while mul02 is organised into subcomponents by
language.  Again subdirectories with appropriate links are provided
for aliasing.

IV.  Interlinear files

The corpora mul07 and mul13 consist of files with alternating elements in
two languages.  Their structure is identical to that of ordinary
monolingual corpora.  They have no aliases, as they have no
monolingual components.

V.  Other cases

The corpora mul10 and mul11 don't fit this picture well:

  mul10:  Consists of multiple English translations of two French
          originals, so there are not an equal number of mfr10 and
          men10 .eci files.  No aliases.

  mul11:  Acquired so late no SGML markup done.  No aliases.

VI.  Aliases

When a multilingual corpus has separable monolingual components, these
are independently linked into the appropriate eci[1-4] data directory,
with corpus names composed of the appropriate language code and a code
number which is formed by adding 10 to the corpus code number of the
parent multilingual corpus.  For example, fre13 is the name of the
French component of mul03, and is linked to the appropriate
subdirectory, i.e. mul03/mfr03.

The point of all this is that e.g. eci1/fre*/*.eci is a pattern which
will pick all and only the type 1 French .eci files, whether
monolingual in origin or components of multilingual corpora.
