<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">SweDN 1.0</titl>
        <parTitl xml:lang="en">SweDN 1.0</parTitl>
        <IDNo agency="SND">doi-10-23695-36v9-9017-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/36V9-9017</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.23695/36V9-9017">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">SweDN 1.0</titl>
        <parTitl xml:lang="en">SweDN 1.0</parTitl>
        <IDNo agency="SND">doi-10-23695-36v9-9017-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/36V9-9017</IDNo>
      </titlStmt>
      <rspStmt />
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2024-01-01" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2024-01-01" />
      </verStmt>
      <holdings URI="https://doi.org/10.23695/36V9-9017">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
SWE-DN

Subtitle
A Swedish text summarization corpus

Created by*
Julius Monsen (julius.monsen@liu.se), Arne Jönsson (arne.jonsson@liu.se)

Publisher(s)*
Linköping University

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/resurser/superlim

Abstract*
The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure

Funded by*
SweClarin

Cite as
[1] A method for building non-English corpora for abstractive text summarization, Julius Monsen, Arne Jönsson, Proceedings of the CLARIN Annual Conference, 2021 

Related datasets
Similar to CNN/DailyMail; part of SuperLim 2.0 collection

II. USAGE

Key applications
Training text summarizers, both extractive and abstractive.

Intended task(s)/usage(s)
Given a text (article), provide its summary.

Recommended evaluation measures
Harmonic mean of Bleu and Rouge; Rouge, BERTScore, Coh-Metrix

Dataset function(s)
Model development

Recommended split(s)
"The articles in the dataset fall into five categories: domestic news, economy, sports, culture, other. The training set consists of the first three categories (78% of the dataset), the test set contains the fourth category (12%), the test set the fifth category (10%). The purpose is to have a cross-domain split which helps evaluate the model's ability to generalize to new data. The "other" category was chosen for the test set as the most diverse one (and presumably the most difficult)."

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
38,121 news articles with corresponding preambles

Nature of the content*
News texts

Format*
JSONL and TSV files with id, headline, summary, article and article category. An additional file with various statistics for each entry (including length measures, embedding similarity and article category) can be accessed at Språkbanken's website. The entries can be matched using the ids.

Data source(s)*
Dagens Nyheter news texts from 2000--2020

Data collection method(s)*
Received 1,936,576 news articles from Dagens Nyheter

Data selection and filtering*
Filtered to resemble the CNN/DailyMail dataset, see [1]

Data preprocessing*
See [1]

Data labeling*
None

Annotator characteristics

IV. ETHICS AND CAVEATS

Ethical considerations

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
20221217, Julius Monsen

Which changes have been made, compared to the previous version*
First data release

Access to previous versions

This document created*
20221206, Arne Jönsson

This document last updated*
20230203, Aleksandrs Berdicevskis

Where to look for further details

Documentation template version*
v1.1

VI. OTHER

Related projects

References
[1] A method for building non-English corpora for abstractive text summarization, Julius Monsen, Arne Jönsson, Proceedings of the CLARIN Annual Conference, 2021</abstract>
      <abstract xml:lang="sv" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
SWE-DN

Subtitle
A Swedish text summarization corpus

Created by*
Julius Monsen (julius.monsen@liu.se), Arne Jönsson (arne.jonsson@liu.se)

Publisher(s)*
Linköping University

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/resurser/superlim

Abstract*
The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure

Funded by*
SweClarin

Cite as
[1] A method for building non-English corpora for abstractive text summarization, Julius Monsen, Arne Jönsson, Proceedings of the CLARIN Annual Conference, 2021 

Related datasets
Similar to CNN/DailyMail; part of SuperLim 2.0 collection

II. USAGE

Key applications
Training text summarizers, both extractive and abstractive.

Intended task(s)/usage(s)
Given a text (article), provide its summary.

Recommended evaluation measures
Harmonic mean of Bleu and Rouge; Rouge, BERTScore, Coh-Metrix

Dataset function(s)
Model development

Recommended split(s)
"The articles in the dataset fall into five categories: domestic news, economy, sports, culture, other. The training set consists of the first three categories (78% of the dataset), the test set contains the fourth category (12%), the test set the fifth category (10%). The purpose is to have a cross-domain split which helps evaluate the model's ability to generalize to new data. The "other" category was chosen for the test set as the most diverse one (and presumably the most difficult)."

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
38,121 news articles with corresponding preambles

Nature of the content*
News texts

Format*
JSONL and TSV files with id, headline, summary, article and article category. An additional file with various statistics for each entry (including length measures, embedding similarity and article category) can be accessed at Språkbanken's website. The entries can be matched using the ids.

Data source(s)*
Dagens Nyheter news texts from 2000--2020

Data collection method(s)*
Received 1,936,576 news articles from Dagens Nyheter

Data selection and filtering*
Filtered to resemble the CNN/DailyMail dataset, see [1]

Data preprocessing*
See [1]

Data labeling*
None

Annotator characteristics

IV. ETHICS AND CAVEATS

Ethical considerations

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
20221217, Julius Monsen

Which changes have been made, compared to the previous version*
First data release

Access to previous versions

This document created*
20221206, Arne Jönsson

This document last updated*
20230203, Aleksandrs Berdicevskis

Where to look for further details

Documentation template version*
v1.1

VI. OTHER

Related projects

References
[1] A method for building non-English corpora for abstractive text summarization, Julius Monsen, Arne Jönsson, Proceedings of the CLARIN Annual Conference, 2021</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>