This directory contains shell scripts for UN*X for processing ECI/MCI
files.  They may also be helpful as models for processing on other
platforms.

---
eci
---

This is a script for applying sgmls to a .sgm file, and processing all
the data files subsumed thereby.  It is a wrapper for sgmls which sets
up SGML_PATH and incorporates lib/tei/tei.dcl, the required SGML
declaration.  It uses src/perl/ecipath.pl to determine the right
setting of SGML_PATH.  It does NOT require to be run from the
directory where the .sgm file is located, and if it can't find the
named file, will try to find it in the appropriate place under
$ECI_ROOT/data.  The following is taken from the script itself:

  Usage: eci [-trace] [-dir dir] [switches for sgmls] [filename]

  Invoke tracing with -trace flag, to debug entity location failures
  pipe into "grep ENOENT"

  When piping in, use -dir to specify the origin directory

  Echos SGML_PATH to STDERR if -showpath is used

  Uses the environment variable ECI_ROOT, which should be the full
  pathname of the root of the ECI directory tree -- usually
  this will be the mount directory of an ECI CD.

  Uses the environment variable ECI_PATCH, which can be used in future to
  point to patch directories if we find it necessary to distribute patches.
  The value of this if defined should probably be a valid prefix
  to SGML_PATH with %T on each component, e.g. "/foo/%T:/baz/bar/%T:"
  NOTE THE TRAILING COLON!!!!

Note that although eci can also accept material piped in from STDIN,
this is primarily for the use of oneeci, see below.

------
oneeci
------

This is a script for applying sgmls to a single .eci file, invoked
either with a full pathname to that file or by specifying a corpus id,
component letter and if necessary file index, e.g. ger03a02.eci or
ger03 a 02.  As described above for eci, if the given pathname does
not resolve to a file, oneeci will try to find one of the right name
in the appropriate place under $ECI_ROOT/data.  oneeci uses eci, so
switches which make sense for eci also can be used for oneeci.  The
following is taken from the script itself:

  fake a .sgm file for a SINGLE corpus file
  usage:  oneeci corp comp [sub] [switches] -- e.g. one-eci dut02 a 03
        or
          oneeci file.eci [switches]        -- e.g. one-eci ../eci1/dut02b.eci
        or
          oneeci - {file.eci,corp comp [sub]} [switches] -- data from STDIN
                                                       but corpus id from ARGs
  remaining args are passed to eci and/or sgmls

--------
textonly
--------

This is a script which relies on the Text/Markup Invariant to strip
off all markup and provide only the text from data (.eci) files.  As
such it is usable on any .eci files under data/eci[12].  It will also
replace the three standard entities (&amp, &lab and &sq) with their
definitions (&, < and ') if invoked with the -e switch, and if invoked
with -exxxnn, where xxxnn is a corpus id (e.g. dut02), will attempt to
find any other data entity definitions in $ECI_ROOT/...xxxnn.ent and
replace them as well.  It uses egrep by default, but if you have have
GNU egrep, you should provide that as the value of ECI_GREP, to speed
things up.  On the other hand if you lack egrep but have grep, you
will need to set ECI_GREP to that.  The following is taken from the
script itself:

  give a stream of just text from ECI files
  Usage:  textonly [-b] [-e[corp-code]] files...
  With -e, will substitute &, < and ' for &lab;, &lab; and &sq respectively and
    will replace & nnn; with the appropriate character (i.e. & 27 -> <ESC>).
  With -ecorp-code, will attempt to find further entity definititions
   from corp-code.ent and substitute them as well.
  With -b, will include blank lines, at a very slight time penalty,
   otherwise not.

--------
isoscrpt
--------

Supports printing of ECI/MCI files using ISO-LATIN-{1,5,7} encoding.
It makes use of the Adobe Transcript utility enscript, which any UN*X system
capable of printing postscript should have.
The following is taken from the script itself:

  A wrapper for enscript which prints using an ISO-8859 character map
  Defaults to Latin-1 (Roman), but use -fn 5 for Latin-5 (Cyrillic)
  				       -fn 7 for Latin-7 (Greek)
  Point size defaults to 12
  Usage: isoscrpt [-fn {1,5,7}] [-ps pointsize] [enscriptargs]

For example,

  isoscrpt $ECI_ROOT/data/eci1/fre01/fre01a01.eci

will print some of the French data, while

  isoscrpt -fn 5 -ps 14 $ECI_ROOT/data/eci2/rus01/rus01b01.eci

will print some of the Russian data.

Note that because the fonts used for Latin 5 and Latin 7 are Type 3,
printing these will be quite slow.

--------
ebdc2iso
--------

This is a simple filter to convert EBCDIC to ISO-8859-1, with a switch
to choose between two common variants of EBCDIC.  From the script:

  EBCDIC to ISO Latin 1 conversion
  Usage: ebcdic2iso [-37|-500] [files]

------
dopost
------

This is a wrapper for the perl script src/perl/post.pl (q.v.), which is a
specialisation of sgmls.pl, a script provided with the sgmls
distribution.  It invokes post.pl with parameters designed to
reconstruct the output of sgmls when applied to ECI/MCI material back
into sgml.  In fact for the body of the data files under data/eci[12],
this should be the identity transformation.

Try
 eci ita03.sgm | dopost | more
for example.
