Editorial Practices in the Markup of the Leiden Corpus of Dutch

The primary goal in marking up this corpus has been to preserve
as much as possible of the original while rendering it TEI-conformant,
at the same time maintaining the integrity of the text
To this end almost all modifications are reversible.

As the corpus is composed of a number of different sub-corpora, in
some cases of quite distinct composition, the markup varies somewhat
from component to component.  We begin by describing general practice,
followed by sections specific to each sub-component.  It is apparent
that the process by which the material came to the form in which we
received it varied from sub-component to sub-component, with a
concommitant variation in quality.

I.  General Issues across the Whole Corpus

The only non-reversible changes which have been made are:
  Lines with only blank chars on have been rendered empty;
  Line-initial and line-final blank characters have been elided.
  Line break information has been lost in a small number of cases (see below
   under ??)


I.1  Structure

A modest level of structuring has been added in an effort to make the
material more useable.  Although we had very little information about
the files, nor any actual paper examples to compare with, some
structure could sometimes be induced by examination.  As elsewhere
<div0> is used to record the original file structure, and <div1> the
top-level constituents, articles, broadcasts, etc.  <p> is used
throughout to mark base-level text paragraphs.  Where variation in
inter-paragraph leading was judged to be potentially significant, it
has be recorded in a 'rend' attribute in a "key:value" form with 'pl'
used for the key and the number of blank lines, if different from 1,
as the value.

I.2  Character Set Normalisation

Different sub-components have used different encodings of accented
characters, and indeed some sub-components have not encoded them at
all.  All the encodings have been converted to ISO-8859-1.

The following note accompanied the original of the corpus:

    DOS characters which could not be translated to a similar VAX/VMS character
    are simulated as follows:
	  [xx;yyy]

    xx = WordPerfect map number
    yyy = character number

Two of the sub-components (f, Onze Taal and i, Sociaal Ontwerpen)
contained a small number of these, i.e. [04;002] and [04;014].  These
have been encoded with entity references &obc1; and &obc2;
respectively.  It is not at all clear what the intended characters
are.  &obc2 occurs only 11 times, on a line by itself.  &obc1; occurs
in a number of contexts which suggest no single interpretation is
possible, e.g. "DERRI[04;002]RE", which suggests E-grave, and
"d'[04;002]uvres", which suggests the oe digraph.

While every attempt has been made to validate the procedure actually
carried out to translate the encodings, with a corpus of this size no
guarantees can be given that there were no mistakes, either of
omission or commission.  We will gladly receive any queries or
suggestions of error and will follow them up if at all possible.

I.3  Punctuation

No changes have been made here.  Subject to the exceptions noted below
in section II, soft hyphens appear to have been
removed, but a lengthy examination to check this has not been done.
We note uses of "-" for punctuation, hard hyphenation and morpheme-level
conjunction (e.g. "binnen- en buitenland").

II.  Issues Specific to One or More Sub-components

II.1  dut02a, 8 O'Clock News

Each <div1> is a broadcast, with component parts given as <div2>.  The
'n' attribute on <div2> provides some taxonomy of parts.  <p> is
redundant, there being one per <div2>, often subsuming more than one
apparent actual paragraph, but as there was no simple and reliable
structural indication of real paragraph boundaries, they are not marked.

II.2  dut02b, Queen's Speech

There is a one-to-one correspondence between original files, speeches
and <p>.  <p> therefore always subsumes more than one apparent actual
paragraph, but as there was no simple and reliable structural
indication of real paragraph boundaries, they are not marked.

II.3  dut02c, Children's News

Each <div1> is a broadcast, with a date and component parts given as
<div2>.  Each <div2> has a <head> and a <p>, which again is redundant
and potentially composite without being marked as such.

II.4  dut02d, De Limburger

This is a larger and much more complex sub-component than any of the
others.  There was considerable inconsistency in how various
structural aspects of the newspaper articles were rendered, there were
numerous gaps in the material, and some inconsistency even at the
character level.  Although a considerable amount of work was invested
in recovering structure and distinguishing between the text which
makes up the substance of the newspaper articles, on the one hand, and
headlines, summaries, glosses, captions, etc. on the other, there are
undoubtedly still some failures to apply the markup consistently.

II.4.1  Structure

Articles in the original contained a small number of fields marked
with tags e.g. <DATE:>, <SLUG:> giving date, author, and other such
information, followed by, variously indicated by blank lines, narrow
margins, capitalisation, etc., headlines, summary paragraphs, fuller
statement of authorship, datelines, photo credits, etc., followed by
the main text comprised of blank-line-separated paragraphs.  This is
all marked up as <div1>, with <head>, <byline>, <dateline>, <docdate>
preceding a sequence of <p>, sometimes followed by a further byline.

DEPT, CUST and DPTH tags in the original are translated into
"key:value" annotations in the 'rend' attribute of <div1>.

When articles are split over several pages, this
is sometimes detectable, in which case there will be a subsequent
<div1 type=post-gap>.  This annotation is also used where gaps in the
text appear to cause two articles to run together in the original.

A header may sometimes refer to subsequent material "in kader" -- in a
box.  When the 'boxed' material is detectable subsequently, there will
be a <div1 type=kader> marking it.

The net consequence of all this is that text with <p>...</p> should be
both reasonably clean and not spuriously duplicated, but text
elsewhere should be taken with a grain or more of salt.

II.4.2  Rendering information

'Type' and 'rend' attributes on <head> give some idea of how
the markup was derived:
 type=SPA or Spatje:  Header tag in the original
 type=pm:  Header boundary marked by ¶, º and - rather than blank line(s).
 type=post-gap:  Headers marked by hand following mid-article break of
                 some kind
 type=back-pointer:  Contains reference to preceding portion of
		     article (i.e. "Continued from page 1")
 rend=blb:  One or more blank lines preceding -- <head>s without this tag
            followed on directly from the previous element
 rend=narrow:  Header marked out because it was composed of lines all
               roughly width 20 or less
 rend=mid:     Header marked out because it was composed of lines all
	       with width roughly between 20 and 40

A "pl:(number)" 'rend' attribute on <p> marks cases where paragraphs
were separated from their predecessors by other than a single blank
line, in which case the number gives the number of blank lines.  Note
this is NOT the same as for <head> above, where the unmarked case is
NO blank lines preceding.

In a small number of cases (48) paragraphs were seen to begin with
one or more words in all caps, preceded by the annotation "Unciaal:".
This has been taken to be rendering information, and transformed into
<hi rend=Unciaal>...</hi> around the upper-cased words.

Other values which appear in the 'rend' attribute are taken directly from the
original where it was apparent that they were typographic instructions
rather than content, e.g. "3kol" preceding a head was taken to mean
"set this in three columns" and so moved to 'rend'.

Note that multiple values for the 'rend' attribute are separated by
commas (,).

II.4.3  Gaps

The annotations <TEXT:> and <TEXT:>\n<* * *>\n<TEXT:> occured
throughout the original text (9183 and 156 times, respectively), and by and
large appear to indicate gaps in the original, as they usually disrupt
the continuity of the text at various levels.  These original tags
have been translated to occurences of the tag <omit>, with the value
of the attribute 'desc' set to "lacuna" or "large lacuna",
respectively, with the 'resp' attribute given as 'inl', although we do
not actually know at what point in the capture process these gaps were
introduced.

Narrow-set articles are an exception to all of this, see below.

Note that in the most recent releases of the TEI P2 DTDs, <omit> is
renamed <gap> -- for backward compatibility with our earlier work we
have used the P2 facilities for renaming to retain the name <omit>.
We have also used P2 facilities to expand the range of allowed
locations of this element.

II.4.4  Narrow-set articles and soft hyphens

In around 5% of the articles, the entire text has been set on very
narrow margins.  The original ends many of the short lines in these
articles with the tag <<<NEED-LINES=6>>>, apparently where a soft
hyphen was required.  In order to preserve the utility of the corpus
for word searches etc., these have been 'corrected' to give the
intended sequence of un-hyphenated words, e.g.

 Original
    van kamerlid H. Jans<<<NEED-LINES=6>>>
    sen dat binnenkort met de

 ECI version
    van kamerlid H.
    <corr sic="Jans&nl6;sen" resp=eci>
    Janssen
    </corr>
    dat binnenkort met de

All the usual dangers attendant on hyphen elimination in the absence
of a comprehensive dictionary are present -- it is entirely possible
that some of these tags stood for HARD hyphens which have been
mistakenly removed.

Curiously enough, given its distribution elsewhere, the <TEXT:>
tag regularly occured in the original BETWEEN the <<<NEED-LINES=6>>>
and the following line, but in none of the cases examined could any
discontinuity be detected.  Accordingly, these cases were treated as
above and the tag eliminated, but a 'rend' attribute was used to flag
the occurence, e.g.

 Original
    Koophandel gaat de econo<<<NEED-LINES=6>>>
    <TEXT:>
    mische structuur van deze

 ECI version
    Koophandel gaat de
    <corr sic="econo&nl6;mische" rend="...<text:>..." resp=eci>
    economische
    </corr>
    structuur van deze

II.4.5  Character set normalisation

A few instances of uninterpretable characters have been replaced with
self-substituting entities:

    ¨ &uml; umlaut overstrike, semantics unclear, some kind of tab?
    Ð &bullet; capital eth originally, probably some kind of bullet?
    · &myst; middle dot originally, semantics completely unclear

º occurs as an alternative dash and/or paragraph separator.  In the
former role, it is replaced with &dash;, defined as -.  Undoubtedly
some of these should actually be paragraph breaks.

Two characters in the normally unused range \200-\237 appear:  \233
once, where it probably stands for %, and \216 nine times, looking
totally superfluous, or possible like a soft hyphen.  These have been
replaced with entities as follows:
    \233 &percent; %
    \216 &shyp; -
In the latter case, the entity only appears in the 'sic' attribute of a
<corr> tag, with the clean word as the correction, e.g.
  Original
     Schaes\216berg

  ECI version
    <corr sic='Schaes&shyp;berg' resp=eci>
    Schaesberg
    </corr>

II.4.6 Non-reversible changes

  ¶, º and - (\266,\272,\055) were in various circumstances used as
  paragraph breaks, all marked up with <p> or </p>.
  The amount of paragraph leading within the header region of each
  article is not noted.
  
  The corrections of <<<NEED-LINES 6>>> and \216 discussed above often
  involves the loss of line-break information.

II.5 dut02e, Newspaper Citation

A special-purpose element <ncit> is used here, as there was nothing
appropriate in the TEI inventory.  Its structure is somewhat ad-hoc,
mirroring the structure indicated in the original by the use of
various tags.  The definition of <ncit> is as follows:

<!ELEMENT ncit    - - ( word, meaning?, remark?, seeAlso? ,
	              (( date, source, p ) | see )? )           >

Where meanings are numbered and/or marked with specialised
sub-language codes, we have used <list> with <label>ed <item>s.

Angle brackets were occasionally used in the body of <meaning> or
<remark> -- we were unsure what the significance of this was, but have
marked it up as <hi rend='<>'>...</hi>.

The character \253 («, open double angle) occurs twice, probably
intended as \275 (½, 1/2), and is represented as &dbllang;, with « as
its definition.

II.6 dut02f, Onze Taal

Atlhough articles were tagged with some structural and presentational
markup in the original, the structural markup (of articles and
paragraphs) was not completely consistently applied.  The basic
structure is a sequence of articles marked as <div1>, with most having
a one or more one-line introductory elements, marked as <head> and a
sequence of paragraphs.  The last header often appears to be a byline,
so these have been marked <head type=pbyl>.  Single lines appearing
outside the original tagging immediately preceding paragraphs.  These
appear to be sub-heads, and have been marked <p type=sub-head>.  It is
unclear what the scope of these is, i.e. whether the introduce
sub-sections of several paragraphs or simply apply to a single
paragraph.  It is this uncertainty which has kept us from introducing
a <div2> level in these cases, which would have allowed the more
correct <head> tag to be used.

Similarly, there appear to be cases where article boundaries have not
been marked in the original, based on the blank-line structure.  Again
because of uncertainties about structure, we have simply marked the
text which we think are headers but not originally tagged as such with
<p type=phead> or <p type=pbyl>.  Finally, some material set of with
blank lines but not tagged as paragraphs in the original none-the-less
usually ARE paragraphs, and have been tagged <p type=unmarked>.  Some
of these are certainly not semantically strictly speaking paragraphs,
but rather e.g. sub-heads, bibliographic references, addresses,
tables, etc.

Tables were a problem, as in some cases tab characters were actually
used.  In all such cases, and in a few were multiple spaces were used
for formatting tables, we have introduced a minimal table element and
tagged the material using it.  Its definition is effectively

<!ELEMENT table - - (thead?,trow*) >
<!ELEMENT thead - - ((trow|%paraContent)*) >
<!ELEMENT trow - - (item*) >

so that e.g. a table row in the original of the form

 "one	two	three"

would be marked up as

 <trow>
 <item>
 one
 </item>
 <item>
 two
 </item>
 <item>
 three
 </item>
 </trow>

Note that this is one of the few texts in the whole corpus in which a
left angle bracket appears in the body of the text, and furthermore in
a few cases does so unavoidably at the beginning of a line.  Since
this would violate the basic ECI text/markup invariant (see the
Editorial Declaration for the corpus as a whole for further discussion
of this invariant), we replace the line-initial left angle brackets
with the corresponding entity, namely &lab;

II.7 dut02g, Town Councils

The structure of lessons, sections, paragraphs and sub-paragraphs was
largely induced semi-automatically on the basis of numbering in the
text: only lesson boundaries were actually marked with private tags in
the original.  The allocation of material between <head> and <p> was
done semi-automatically (by a non-speaker of Dutch) on the basis of
capitalisation -- no other evidence was available, although presumably
at some point e.g. word-processor codes must have been present.  For
example:

  Original
    . . . worden voortgezet. 3.1.2 Samenwerking tussen de
    politie In afwachting van de definitieve nieuwe . . .

  Result:
    . . . worden voortgezet.
    </p>
    </div4>
    <div4 type=sub-para n=3.1.2>
    <head>
    Samenwerking tussen de politie
    </head>
    <p>
    In afwachting van de definitieve nieuwe . . .

Some errors of structuring have undoubtedly been introduced at this point.

In a small number of cases, <div3> was introduced on the basis of
capitalisation evidence alone, and a paragraph number assigned, not
taken directly from the text.  This was done where there was an
obvious gap in the sequence of explicit paragraph numbers which were
present in the text, and in all cases the number of gaps matched the
number of anomalous capitalisations.  These cases are marked with type
ppara, for probable paragraph.

Page boundaries were marked in the original, and have been marked up
with the <pb> empty tag.

The middle dot character (·) occured three times in a list-like
context -- this has been replaced with the &bullet; entity reference.

II.8  dut02h, i and j

These sub-components have been withdrawn and included as dut03 because
there was insufficient time available to clean up various
idiosyncratic formatting in the originals.