Editorial Practices in the Markup of the Frankfurter Rundschau Corpus The primary goal in marking up this large corpus has been to preserve as much as possible of the original while rendering it TEI-conformant, at the same time maintaining the integrity of the text To this end almost all modifications are reversible. The only non-reversible changes which have been made are: Lines with only blank chars on have been rendered empty; Line-final blank characters have been elided. Line break information has been lost in a small number of cases (see below under II.10) I. Structure A modest level of structuring has been added in an effort to make the material more useable. Although we had very little information about the files, nor any actual paper examples to compare with, some structure could be induced by examination. Text paragraphs were (usually) set of by one or more blank lines, with the first line indented (usually) three spaces. A sequence of such paragraphs was often preceded by a short headline, recognisable by being left-flush. Accordingly, a unit consisting of some paragraphs preceded by a headline are identified as , with the headline marked with and the paragraphs with

. Blank lines are removed, but 'rend' attributes are provided wherever this is done. Similarly for the paragraph-initial spacing. The key 'pl' is used for blank lines (short for "paragraph leading"), and the key 'pi' is used for indentation (short for "paragraph indent"). Thus we have the following as an example of a pair of original and marked up version: ---------- xxx Shock Flash Horror: Umlauts Invade Text! In a press conference yesterday the Committee for Public Freedom from Accents alleged that their efforts were being undermined by a hitherto unknown group calling itself "The Uncommittee for Unlimited Umlauts". The conference had to be abandoned before any further details were forthcoming, however, äs thë prëss rëlëäsë wäs dïscövërëd tö bë cömplëtëlÿ dïsfïgürëd bÿ ä prölïfërätïön öf döüblë döts. yyyy ------------------ Shock Flash Horror: Umlauts Invade Text!

In a press conference yesterday the Committee for Public Freedom from Accents alleged that their efforts were being undermined by a hitherto unknown group calling itself "The Üncommittee for Ünlimited Ümlauts".

The conference had to be abandoned before any further details were forthcoming, however, äs thë prëss rëlëäsë wäs dïscövërëd tö bë cömplëtëlÿ dïsfïgürëd bÿ ä prölïfërätïön öf döüblë döts.

--------------- Note that leading of 1 and indent of 3 is the default, and unmarked. Material other than such articles is simply marked as a sequence of

, with blank lines removed and coded as for articles, but with line-initial whitespace untouched. Strictly speaking such a use of the 'p' tag is not TEI-compliant, as the contents of the elements so marked are not known to be "paragraphs in prose", indeed in many cases are obviously NOT "paragraphs in prose", nor should whitespace be allowed to appear to bear significance. None-the-less we felt it was worth it to provide some level of structuring, and

within should thus be interpreted as "blank-line delimited text with (potentially) significant whitespace". See below under II.10 for an exception to the preservation of structure. II. Character Set Normalisation The material as provided to us was the output of a conversion programme from an EBCDIC original. This conversion to ISO 8859-1 was, as far as we can tell, entirely reliable for ordinary ASCII characters, and for a small number of accented characters used heavily in German (i.e. A/O/U/a/o/u umlaut, double-s, e-acute). But for a wide range of other accented characters and special symbols, the results were clearly wrong. Our handling of these inappropriate characters falls into three categories, depending on the kind of conversion carried out: Transparent, Semi-transparent and opaque. II.1 Transparent: Text characters mis-converted By examining the examples and their context, we established the following correspondence: As supplied: ¡£¤¨ª«¬®¯°´µ·¸º»½¾ÀÁÂëïñ Corrected: âàáñêëèîïìÀÁÅÇøÉãÈíóúòûù An inspection of the EBCDIC code page will reveal that the majority of these "make sense", in that e.g. the ISO-8859 codes for ª«¬®¯° each differ from the EBCDIC codes for êëèîïì by 88, i.e. contiguous ranges of EBCDIC were translated into contiguous ranges in ISO-8859. These corrections were applied without any additional annotation. II.2 Transparent: Symbols mis-converted Only one of these we could identify sufficiently reliably to go ahead as in II.1: As supplied: ò Corrected: · II.3 Transparent: Escape sequence It would appear that for characters missing from EBCDIC, underscore (_) was used as an escape, followed by a single letter code. In one case, the intended (we think) character is available in ISO-8859 and this has been inserted instead: As supplied: _c Converted: © II.4 Semi-transparent: Hapaxes and apparent errors A number of characters occur only a very small number of times, either alone on a line (¥, NBS (\240), ¿), in the context of gibberish reminiscent of word-processing codes (í, È) or apparently explicitly marking some kind of dropout in the original (Ø). All these have been coded using the TEI tag , using the 'desc' attribute to retain the original material, e.g. . As a result, text-only processing will not see these, but they are recoverable. A related case is handled similarly: ÿ (\377) appeared as the last character at the end of each of the original files. II.5 Semi-transparent: Bracketing markup There are two cases where a pair of characters was used to delimit regions of text. The first, at a guess used to indicate italics or boldface, used the characters ³ and ü (note that since ü never occurs word-finally in German, this use could be distinguished from its ordinary textual use). The second, much less common and possibly indicating underlining, used the characters Ó and Ì. Marking these up without loss of information was tricky, and no perfect solution offered itself. In the end we used a combination of two approaches, depending on whether or not the boundaries of the delimited region were clean (space/BOL/EOL) or not (letters/punctuation). In the clean cases, the delimited region (with delimiters removed) is marked with the tag, with 'rend' attribute either "3-um-bracket" or "Oacute-Igrave-bracket" as appropriate. In other cases, where to take this approach would introduce potential tokenisation errors, the approach has been to use the tag. The 'rend' attribute is used as above to indicate the type of bracket. The 'sic' attribute contains the original text, INCLUDING delimiters and all adjoining non-blank characters. The same text, but with delimiters removed, is included as the body of the element. Two examples: Strecke ³Kosten zwischen 35 und 40 Millionen Markü führte die FR ein ==> Strecke Kosten zwischen 35 und 40 Millionen Mark führte die FR ein beschränkt werden (vgl. BVerfGE 44, 197 ³202ü). Die Bestimmungen des Soldatengesetzes, ==> beschränkt werden (vgl. BVerfGE 44, 197 202). Die Bestimmungen des Soldatengesetzes, The crucial point is that markup aside, there is a complete and unchanged sequence of tokens presented. There are some inconvenient further complexities here. On a few occasions, the one bracket character appears without the other. On the basis of context, in the case of the I-O brackets, it is possible to infer the intended location of the other bracket, and the markup is done on that basis, except that the 'rend' attribute is altered to "-Igrave-bracket" or "Oacute--bracket" to show which bracketing character was missed. In the case of the 3-um brackets, it has not been possible to infer the location of the missing close bracket, and the tag has been used as described in II.4 above to remove the offending character. Note that because of the high frequency of legitimate occurrences of ü, there MAY be cases where it appears in error as an unbalanced right bracket which have gone undetected. Note finally that in the few cases where the approach is used and the bracketed text extends over several lines, the line breaks are lost in the value of the 'sic' attribute, to preserve the "all lines are either all text or all markup" invariant (see the Editorial Declaration for the ECI corpus as a whole). II.6 Opaque: Glyph missing from ISO 8859 In a number of cases, the intended glyph is missing from ISO 8859. In these cases SGML entities are used, and given their own name as definition, e.g. . See the corpus header for the corpus as a whole for a description of the mechanism provided whereby these definitions can be overriden where local applications support some means of printing these entities more directly. It is not always clear from context what glyph is meant, so the names below should be taken with a grain of salt: As Supplied Replace with Comment Ù &dm; Deutschemark ã ã Some kind of cardinal number suffix? å &metresq; Metres squared ÷ ÷ Only two examples, not at all clear þ • Unclear, occurs in headlines _g °ree; For temperatures, mostly _k † Dagger, as indicating deceased, maybe? _m &sqmet; Metres squared II.7 Opaque: Clashes In some cases it appears that two different means have been used to introduce the same glyph. Rather than risk losing information by converting both to the apparently intended target, we've used an entity in the less common case, but given it an appropriate definition, either with a real character (those marked with * below) or an identity map e.g. , : As Supplied Replace with Comment ý ø* Appears in Norwegian, as opposed to Swedish/Danish (where we find º) ó &blt; Another bullet-like character? Ú &rquote;* See below. ¹ °r; Another degree symbol? _o &bullt; Yet another bullet? The case of Ú is complicated. It appears in all the places where apostrophe (') appears: i.e. in contractions and paired with itself, or with pre-word comma, to indicate embedded or 'scare' quotes (e.g. gehtÚs; ,tragische Unruhe',; ,infektiösen KeimeÚ; ÚMonumentalbautenÚ), but since there is at least an in-principle difference between right single quote and apostrophe, we thought it best to preserve the difference in case right single quote was indeed intended. II.8 Opaque: SGML-required Appearances of & and < cannot be allowed to stand when followed by an alphabetic character, because of their reserved use by SGML, so they are replaced with & and &lab; respectively, with appropriate definitions. II.9 Mixed cases There are two characters whose interpretation is contextually determined. The paragraph symbol § is corrected transparently to ç when it occurs within a word, otherwise left alone. The character ô also stands for itself in some circumstances (when it occurs on its own in a word), although as such it often appears where we would have expected long o in the romanisation of various Japanese words. But when it occurs paired with itself, it indicates superscripting of the enclosed material. Since this is always adjacent to other text, we have used the approach described above in II.5 with 'rend' attribute of "super", for example: XVôthô ==> XVth II.10 Transparent, Semi-transparent, Opaque It should now be clear what is meant by this. The transparent modifications leave no trace. The semi-transparent modifications leave (nearly) no trace in the text, only in the markup. The opaque modifications actually affect the text itself. Why the qualification as regards semi-transparent? The usage outlined above of , and , together with the universal text-markup invariant, means that linebreaks within articles and both whitespace and linebreaks within non-articles may not be as in the original in some cases. As there are only 179 instances of these tags in the whole corpus, the impact of this is in practice likely to be slight. While every attempt has been made to validate the procedure actually carried out to perform the above transformations, including the inspection of virtually all the low-frequency (< 3,000 instances) cases and the careful comparison of 'before' and 'after' character histograms, with a corpus of this size no guarantees can be given that there were no mistakes, either of omission or commission. We will gladly receive any queries or suggestions of error and will follow them up if at all possible. II.11 A little assistance for simple text measures To assist in the simple processing of plain text where large-scale coherence is NOT required, those paragraphs which contain one or more entity references are marked with a type attribute with value "messy". There are around 4000 of these out of over 670,000 paragraphs in total. By using a filter to avoid such paragraphs, perhaps restricting to those within , and also eliminating markup lines, a very large amount of plain ISO Latin-1 German text can be obtained very easily. III. Punctuation No changes have been made here. Soft hyphens appear to have been removed, but a lengthy examination to check this has not been done. We note uses of "-" for punctuation, hard hyphenation and morpheme-level conjunction (e.g. "Haupt- und Presseamt"). The character "¦" occurs rarely with unclear semantics, otherwise the punctuation is quite modest. IV. Some Data is Compressed Owing to a tight squeeze for space on the CD, this, the largest corpus in the ECI/MCI, has had to be partially compressed. Sources and/or binaries for the decompression utility gunzip can be found in the relevant architecture-specific subdirectories of src and/or bin. To restore the corpus to its complete state for SGML processing, uncompress the files with extensions .egz to have extensions .eci in a directory off the CD, and include that directory in the search path for the SGML application. If using sgmls via the provided eci script, simply set ECI_PATCH to "uncompressed_dir/%T:" and set the parameter entity %ger03.expanded to INCLUDE, e.g. by using '-iger03.expanded' when invoking eci.