Swedish treebank
https://doi.org/10.23695/51HR-EA34
A Swedish treebank built from recycled language resources
The Swedish treebank has come about through work by researchers in the Universities at
Uppsala (Computational Linguistics, Department of
Linguistics and Philology) and Växjö
(The Language Technology research
group in the School of Mathematics and Systems Engineering). The treebank is the
result of the harmonization of the linguistic information in two existing Swedish
language resources:
Talbanken, a corpus of Swedish written and transcribed spoken
language from the 1970s, manually annotated with syntactic information according
to a traditional Scandinavian approach
SUC2 (Stockholm Umeå Corpus), a morphosyntactically annotated
(all corpus words are tagged with part of speech and lemma), balanced corpus
of published Swedish written language from the 1990s
The harmonization process in brief has been that Talbanken has been annotated
with the morphosyntactic tags used in SUC in a semiautomatic process, and both
Talbanken and SUC have been automatically syntactically annotated with a phrase
structure version of Talbanken's original syntax analysis. This means that
we can expect errors in the syntactic annotation, particularly in SUC.
A preliminary evaluation of the annotation, presented at
a post-conference
workshop at SLTC 2008, shows that the syntactic annotation is still very
useful in corpus-linguistic investigations.
Format, license and distribution
Format
The Swedish treebank is distributed in the TIGER-XML format, so that the
freely available TIGERSearch tool can be used with it. TIGERSearch can be downloaded
from Institut
für Maschinelle Sprachverarbeitung at the Universitety of Stuttgart.
License
The treebank part - i.e., the added syntactic annotations - of
the Swedish treebank, is free, under an open source license.
Talbanken is freely available for research and education purposes, it can
be downloaded here
SUC requires that each user signs an individual license agreement with
the Department of Linguistics, Stockholm University. As of 1st December, 2008, licensing of
SUC is entrusted to Språkbanken, University of Gothenburg. The license agreement can
be downloaded in
pdf format here. Read more about SUC2 and SUC3 here
In order to get access to SUC (and thereby to the Swedish treebank), you are required to
print out the license agreement form, sign it and send it by ordinary mail to
SUC-licens
Språkbanken
Institutionen för svenska, flerspråkighet och språkteknologi
Göteborgs universitet
Box 200
SE-405 30 Göteborg
Sweden
Upon receipt and approval of the agreement, we will contact you by email
with downloading instructions.
Distribution
The Swedish treebank is distributed by Språkbanken, University of Gothenburg.
See the preceding section for instructions, or contact us for more information
by emailing sb-info@svenska.gu.seOpens in a new tab.
If you have a SUC license already, you will get downloading instructions
and password from us.
Others will first need to sign a SUC license agreement (see above).
References
If you wish to cite the Swedish treebank in a paper, please use the following reference:
Joakim Nivre, Beáta Megyesi, Sofia Gustafson-Capková, Filip Salomonsson and Bengt Dahlqvist (2008) Cultivating a Swedish Treebank
In: Nivre, Dahllöf, and Megyesi (Eds), Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, pp111–120.
Uppsala: Acta Universitatis Upsaliensis.
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8933Opens in a new tab
You can give this Språkbanken page as its download location.
Go to data source
Opens in a new tabhttps://doi.org/10.23695/51HR-EA34
Citation and access
Citation and access
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Metadata
Metadata

University of Gothenburg