A Fragment of SUC: the Stockholm-Umeå Corpus of modern written Swedish.

]> A Fragment of SUC: the Stockholm-Umeå Corpus of modern written Swedish. Dept. of linguistics, Stockholm University; Dept. of linguistics, Umeå University Swedish Research Council of the Humanities; Swedish National Board for Industrial and Technical Development Eva Ejerhed and Gunnel Källgren principal authors Gunnel Källgren compilation of the corpus Eva Ejerhed, Gunnel Källgren, Ola Wennstedt, Magnus Åström construction of tag set Gunnar Eriksson, Britt Hartmann, Ola Wennstedt research assistants Britt Hartmann editing and proofreading Gunnar Eriksson, Magnus Åström programming Joana Arnesson, Ulrika Kvist, Cecilia Lyckow, Jan Lindberg, Ola Wennstedt annotation A subset of the 1M words Stockholm-Umeå Corpus (SUC), created especially for the European Corpus Initiative. 31 October 1992. This edition does not show the final version of the SUC annotaion scheme, which is liable to change in the complete edition to appear in 1994. Approximately 300,000 graphic words with annotated linguistic analyses. Institute of linguistics, Stockholm University; Institute of linguistics, Umeå University

S-106 91 Stockholm, Sweden; S-901 87 Umeå, Sweden

Available only for non-commercial purposes, and only through the above authorities.

31 October 1992

The full bibliographic references of the texts can be found in the file "suc-eci.bibl".

&suc-eci-bibliography;

The aim of the SUC project is to develop and test models and methods for automatic analysis of unrestricted text. As part of the project, a large corpus is being created and analyzed. The present edition is a subset of that larger corpus.

150 text samples of approximately 2,000 graphic words each. The samples can be entire articles, excerpts from longer texts, or composite of shorter texts.

The language of the texts is modern written Swedish. The texts must have been first published in 1990 or later. In order to get a balanced corpus with high coverage, the texts are selected and classified according to criteria corresponding to those of the Brown and LOB corpora. The excerption has been done at random, but as far as possible, a coherent and delimited stretch of text has been chosen.

The corpus is almost entirely based on texts that were available in machine readable form, and only to a minor extent relies on scanning or typing in of material.

Clear misprints have been corrected without recording the corrections.

All quotation marks are retained.

All 'soft' hyphenation is removed.

The texts of this corpus fragment are at present segmented into paragraphs. The character @ is used as delimiter.

Each word is tagged with part of speech tagging and morphological information. The principles for the tagging are described in Ejerhed et al.: "The Linguistic Annotation System of the Stockholm-Ume} Corpus Project, version 4.31", Publications from the Department of General Linguistics, University of Umeå no. 32. The tags emanate from an automatic lexicon lookup in the SWETWOL computerized dictionary at the Department of General Linguistics, University of Helsinki, followed by an automatic transduction and a manual disambiguation and correction.

The n attribute of each text in the corpus carries a unique identifying code for the whole text. This code corresponds to an entry in the bibliographical data base. The code is built up from a two-letter code for text type and a three-digit running enumeration within each text type. Where a 2,000-word unit is composite, i.e., consists of several shorter texts, each of those has a lower case letter added to the code.

All words and typographic signs are numbered, and a canonical reference may be made up by concatenating the n value of the text and the word number.

Press: Reportage Political Community Financial Cultural Sports Spot News Press: Editorial Institutional Debate articles Press: Reviews Books Films Art Theatre Music Artists, shows Radio, TV Skills and Hobbies Hobbies, amusements Society press Occupational and trade union press Religion Popular Lore Humanities Behavioural sciences Social sciences Religion Complementary life styles History Health and medicine Natural science, technology Politics Culture Biographies, essays Biographies, memoirs Essays Miscellaneous Federal publications Municipal publications Financial reports, business Financial reports, non-profit organisations Internal publications, companies University publications Learned and scientific writing Humanities Behavioural sciences Social sciences Religion Technologies Mathematics Medicine Natural science General fiction Science fiction and mysteries Light reading, male audience Light reading, female audience Humour Swedish &aa01.wrapper; &aa02.wrapper; &aa03.wrapper; &aa04.wrapper; &aa05.wrapper; &aa09.wrapper; &aa10.wrapper; &aa13.wrapper; &ab01.wrapper; &ab02.wrapper; &ab05.wrapper; &ab06.wrapper; &ab07.wrapper; &ac01.wrapper; &ac02.wrapper; &ad01.wrapper; &ad03.wrapper; &ae01.wrapper; &ae02.wrapper; &ae05.wrapper; &ae07.wrapper; &af01.wrapper; &af02.wrapper; &af06.wrapper; &af07.wrapper; &ba01.wrapper; &ba02.wrapper; &ba03.wrapper; &ba05.wrapper; &ba06.wrapper; &ba07.wrapper; &bb01.wrapper; &bb02.wrapper; &bb03.wrapper; &bb04.wrapper; &bb05.wrapper; &bb07.wrapper; &bb09.wrapper; &ca01.wrapper; &ca02.wrapper; &ca03.wrapper; &ca05.wrapper; &ca07.wrapper; &cb01.wrapper; &cb02.wrapper; &cc01.wrapper; &cc02.wrapper; &cc03.wrapper; &cd01.wrapper; &cd02.wrapper; &ce01.wrapper; &ce02.wrapper; &ce03.wrapper; &cf01.wrapper; &cg01.wrapper; &cg02.wrapper; &cg03.wrapper; &ea01.wrapper; &ea02.wrapper; &ea03.wrapper; &ea04.wrapper; &ea05.wrapper; &ea06.wrapper; &ea07.wrapper; &ea08.wrapper; &eb01.wrapper; &eb02.wrapper; &eb03.wrapper; &eb05.wrapper; &ec02.wrapper; &ec03.wrapper; &ec04.wrapper; &ec05.wrapper; &ec06.wrapper; &ec08.wrapper; &ec09.wrapper; &ed01.wrapper; &ed02.wrapper; &ed03.wrapper; &fa01.wrapper; &fa02.wrapper; &fb01.wrapper; &fb02.wrapper; &fb03.wrapper; &fb04.wrapper; &fh01.wrapper; &fh02.wrapper; &fh03.wrapper; &fh04.wrapper; &fh05.wrapper; &fj01.wrapper; &fk01.wrapper; &fk02.wrapper; &ga01.wrapper; &ga02.wrapper; &gb03.wrapper; &ha01.wrapper; &ha02.wrapper; &ha03.wrapper; &ha04.wrapper; &ha05.wrapper; &ha06.wrapper; &ha07.wrapper; &ha08.wrapper; &ha09.wrapper; &ha10.wrapper; &ha11.wrapper; &ha12.wrapper; &ha13.wrapper; &ha14.wrapper; &ha15.wrapper; &hb01.wrapper; &hb02.wrapper; &hb03.wrapper; &hb04.wrapper; &hb05.wrapper; &hb06.wrapper; &hb07.wrapper; &hb08.wrapper; &hb09.wrapper; &hb10.wrapper; &hb11.wrapper; &hb12.wrapper; &hb13.wrapper; &hb14.wrapper; &hb15.wrapper; &hb16.wrapper; &hb17.wrapper; &hb18.wrapper; &ja01.wrapper; &ja02.wrapper; &ja03.wrapper; &ja04.wrapper; &ja05.wrapper; &ja06.wrapper; &ja07.wrapper; &ja08.wrapper; &ja09.wrapper; &kk03.wrapper; &kk04.wrapper; &kk05.wrapper; &kk06.wrapper; &kk07.wrapper; &kk09.wrapper; &kk10.wrapper; &kk11.wrapper; &kk12.wrapper; &kk13.wrapper; &kk14.wrapper; &rr02.wrapper;