Editorial Practices in the Markup of the Leiden Corpus of Dutch The primary goal in marking up this corpus has been to preserve as much as possible of the original while rendering it TEI-conformant, at the same time maintaining the integrity of the text To this end almost all modifications are reversible. As the corpus is composed of a number of different sub-corpora, in some cases of quite distinct composition, the markup varies somewhat from component to component. We begin by describing general practice, followed by sections specific to each sub-component. It is apparent that the process by which the material came to the form in which we received it varied from sub-component to sub-component, with a concommitant variation in quality. I. General Issues across the Whole Corpus The only non-reversible changes which have been made are: Lines with only blank chars on have been rendered empty; Line-initial and line-final blank characters have been elided. Line break information has been lost in a small number of cases (see below under ??) I.1 Structure A modest level of structuring has been added in an effort to make the material more useable. Although we had very little information about the files, nor any actual paper examples to compare with, some structure could sometimes be induced by examination. As elsewhere is used to record the original file structure, and the top-level constituents, articles, broadcasts, etc.

is used throughout to mark base-level text paragraphs. Where variation in inter-paragraph leading was judged to be potentially significant, it has be recorded in a 'rend' attribute in a "key:value" form with 'pl' used for the key and the number of blank lines, if different from 1, as the value. I.2 Character Set Normalisation Different sub-components have used different encodings of accented characters, and indeed some sub-components have not encoded them at all. All the encodings have been converted to ISO-8859-1. The following note accompanied the original of the corpus: DOS characters which could not be translated to a similar VAX/VMS character are simulated as follows: [xx;yyy] xx = WordPerfect map number yyy = character number Two of the sub-components (f, Onze Taal and i, Sociaal Ontwerpen) contained a small number of these, i.e. [04;002] and [04;014]. These have been encoded with entity references &obc1; and &obc2; respectively. It is not at all clear what the intended characters are. &obc2 occurs only 11 times, on a line by itself. &obc1; occurs in a number of contexts which suggest no single interpretation is possible, e.g. "DERRI[04;002]RE", which suggests E-grave, and "d'[04;002]uvres", which suggests the oe digraph. While every attempt has been made to validate the procedure actually carried out to translate the encodings, with a corpus of this size no guarantees can be given that there were no mistakes, either of omission or commission. We will gladly receive any queries or suggestions of error and will follow them up if at all possible. I.3 Punctuation No changes have been made here. Subject to the exceptions noted below in section II, soft hyphens appear to have been removed, but a lengthy examination to check this has not been done. We note uses of "-" for punctuation, hard hyphenation and morpheme-level conjunction (e.g. "binnen- en buitenland"). II. Issues Specific to One or More Sub-components II.1 dut02a, 8 O'Clock News Each is a broadcast, with component parts given as . The 'n' attribute on provides some taxonomy of parts.

is redundant, there being one per , often subsuming more than one apparent actual paragraph, but as there was no simple and reliable structural indication of real paragraph boundaries, they are not marked. II.2 dut02b, Queen's Speech There is a one-to-one correspondence between original files, speeches and

.

therefore always subsumes more than one apparent actual paragraph, but as there was no simple and reliable structural indication of real paragraph boundaries, they are not marked. II.3 dut02c, Children's News Each is a broadcast, with a date and component parts given as . Each has a and a

, which again is redundant and potentially composite without being marked as such. II.4 dut02d, De Limburger This is a larger and much more complex sub-component than any of the others. There was considerable inconsistency in how various structural aspects of the newspaper articles were rendered, there were numerous gaps in the material, and some inconsistency even at the character level. Although a considerable amount of work was invested in recovering structure and distinguishing between the text which makes up the substance of the newspaper articles, on the one hand, and headlines, summaries, glosses, captions, etc. on the other, there are undoubtedly still some failures to apply the markup consistently. II.4.1 Structure Articles in the original contained a small number of fields marked with tags e.g. , giving date, author, and other such information, followed by, variously indicated by blank lines, narrow margins, capitalisation, etc., headlines, summary paragraphs, fuller statement of authorship, datelines, photo credits, etc., followed by the main text comprised of blank-line-separated paragraphs. This is all marked up as , with , , , preceding a sequence of

, sometimes followed by a further byline. DEPT, CUST and DPTH tags in the original are translated into "key:value" annotations in the 'rend' attribute of . When articles are split over several pages, this is sometimes detectable, in which case there will be a subsequent . This annotation is also used where gaps in the text appear to cause two articles to run together in the original. A header may sometimes refer to subsequent material "in kader" -- in a box. When the 'boxed' material is detectable subsequently, there will be a marking it. The net consequence of all this is that text with

...

should be both reasonably clean and not spuriously duplicated, but text elsewhere should be taken with a grain or more of salt. II.4.2 Rendering information 'Type' and 'rend' attributes on give some idea of how the markup was derived: type=SPA or Spatje: Header tag in the original type=pm: Header boundary marked by ¶, º and - rather than blank line(s). type=post-gap: Headers marked by hand following mid-article break of some kind type=back-pointer: Contains reference to preceding portion of article (i.e. "Continued from page 1") rend=blb: One or more blank lines preceding -- s without this tag followed on directly from the previous element rend=narrow: Header marked out because it was composed of lines all roughly width 20 or less rend=mid: Header marked out because it was composed of lines all with width roughly between 20 and 40 A "pl:(number)" 'rend' attribute on

marks cases where paragraphs were separated from their predecessors by other than a single blank line, in which case the number gives the number of blank lines. Note this is NOT the same as for above, where the unmarked case is NO blank lines preceding. In a small number of cases (48) paragraphs were seen to begin with one or more words in all caps, preceded by the annotation "Unciaal:". This has been taken to be rendering information, and transformed into ... around the upper-cased words. Other values which appear in the 'rend' attribute are taken directly from the original where it was apparent that they were typographic instructions rather than content, e.g. "3kol" preceding a head was taken to mean "set this in three columns" and so moved to 'rend'. Note that multiple values for the 'rend' attribute are separated by commas (,). II.4.3 Gaps The annotations and \n<* * *>\n occured throughout the original text (9183 and 156 times, respectively), and by and large appear to indicate gaps in the original, as they usually disrupt the continuity of the text at various levels. These original tags have been translated to occurences of the tag , with the value of the attribute 'desc' set to "lacuna" or "large lacuna", respectively, with the 'resp' attribute given as 'inl', although we do not actually know at what point in the capture process these gaps were introduced. Narrow-set articles are an exception to all of this, see below. Note that in the most recent releases of the TEI P2 DTDs, is renamed -- for backward compatibility with our earlier work we have used the P2 facilities for renaming to retain the name . We have also used P2 facilities to expand the range of allowed locations of this element. II.4.4 Narrow-set articles and soft hyphens In around 5% of the articles, the entire text has been set on very narrow margins. The original ends many of the short lines in these articles with the tag <<>>, apparently where a soft hyphen was required. In order to preserve the utility of the corpus for word searches etc., these have been 'corrected' to give the intended sequence of un-hyphenated words, e.g. Original van kamerlid H. Jans<<>> sen dat binnenkort met de ECI version van kamerlid H. Janssen dat binnenkort met de All the usual dangers attendant on hyphen elimination in the absence of a comprehensive dictionary are present -- it is entirely possible that some of these tags stood for HARD hyphens which have been mistakenly removed. Curiously enough, given its distribution elsewhere, the tag regularly occured in the original BETWEEN the <<>> and the following line, but in none of the cases examined could any discontinuity be detected. Accordingly, these cases were treated as above and the tag eliminated, but a 'rend' attribute was used to flag the occurence, e.g. Original Koophandel gaat de econo<<>> mische structuur van deze ECI version Koophandel gaat de economische structuur van deze II.4.5 Character set normalisation A few instances of uninterpretable characters have been replaced with self-substituting entities: ¨ ¨ umlaut overstrike, semantics unclear, some kind of tab? Ð • capital eth originally, probably some kind of bullet? · &myst; middle dot originally, semantics completely unclear º occurs as an alternative dash and/or paragraph separator. In the former role, it is replaced with ‐, defined as -. Undoubtedly some of these should actually be paragraph breaks. Two characters in the normally unused range \200-\237 appear: \233 once, where it probably stands for %, and \216 nine times, looking totally superfluous, or possible like a soft hyphen. These have been replaced with entities as follows: \233 &percent; % \216 ­p; - In the latter case, the entity only appears in the 'sic' attribute of a tag, with the clean word as the correction, e.g. Original Schaes\216berg ECI version Schaesberg II.4.6 Non-reversible changes ¶, º and - (\266,\272,\055) were in various circumstances used as paragraph breaks, all marked up with

or

. The amount of paragraph leading within the header region of each article is not noted. The corrections of <<>> and \216 discussed above often involves the loss of line-break information. II.5 dut02e, Newspaper Citation A special-purpose element is used here, as there was nothing appropriate in the TEI inventory. Its structure is somewhat ad-hoc, mirroring the structure indicated in the original by the use of various tags. The definition of is as follows: Where meanings are numbered and/or marked with specialised sub-language codes, we have used with