<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Svensk analogi 2.0</titl>
        <parTitl xml:lang="en">Swedish analogy 2.0</parTitl>
        <IDNo agency="SND">doi-10-23695-b2m4-5y87-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/B2M4-5Y87</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.23695/B2M4-5Y87">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Svensk analogi 2.0</titl>
        <parTitl xml:lang="en">Swedish analogy 2.0</parTitl>
        <IDNo agency="SND">doi-10-23695-b2m4-5y87-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/B2M4-5Y87</IDNo>
      </titlStmt>
      <rspStmt />
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2024-01-01" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2024-01-01" />
      </verStmt>
      <holdings URI="https://doi.org/10.23695/B2M4-5Y87">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
Swedish analogy test set v1.1

Subtitle
Swedish semantic and syntactic similarity test set

Created by*
Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU

Publisher(s)*
Språkbanken Text (sb-info@svenska.gu.se)

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/en/resources/superlim

License(s)*
CC BY 4.0

Abstract*
The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).

Funded by*
Vinnova (grant no. 2019-02996)

Cite as
[1]

Related datasets
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).

II. USAGE

Key applications
Intrinsic evaluation of Swedish word embeddings

Intended task(s)/usage(s)
Given a word pair A and B and a word C, find a word D such that A is to B as C is to D (A:B::C:D)

Recommended evaluation measures
Accuracy

Dataset function(s)
Few-shot training ('prompting'), testing

Recommended split(s)
A few-shot training set (aka 'prompt', 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples. Those are split into 2045 train samples and 18,593 test samples. No effort was made to control the balance of syntactic and semantic samples in train and test, the split was random.

Nature of the content*
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.

Format*
TSV/JSONL with 5 columns/objects: four words and a category. The word to be predicted is called 'label', the given words 'pair1_element1', 'pair1_element2', and 'pair2_element1'.

Data source(s)*
Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/

Data collection method(s)*
Two Swedish native speakers proof-read the finished version. The inter-agreement score was calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki 

Data selection and filtering*
The dataset was postprocessed and corrected by Lars Borin and Aleksandrs Berdicevskis

Data preprocessing*
Does not apply

Data labeling*
Does not apply

Annotator characteristics
Two Swedish native speakers

IV. ETHICS AND CAVEATS

Ethical considerations

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
2023-03-05, Gerlof Bouma

Which changes have been made, compared to the previous version*
Minor format changes

Access to previous versions
Work in progress

This document created*
2021-05-20, Tosin Adewumi

This document last updated*
2023-03-05, Gerlof Bouma

Where to look for further details
[1],[2]

Documentation template version*
v1.1

VI. OTHER

Related projects

References
[1] Adewumi, T. P., Liwicki, F., &amp; Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword &amp; wikipedia corpora. arXiv preprint arXiv:2011.03281. 

[2] Adewumi, T. P., Liwicki, F., &amp; Liwicki, M. (2020). Exploring Swedish &amp; English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.</abstract>
      <abstract xml:lang="sv" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
Swedish analogy test set v1.1

Subtitle
Swedish semantic and syntactic similarity test set

Created by*
Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU

Publisher(s)*
Språkbanken Text (sb-info@svenska.gu.se)

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/en/resources/superlim

License(s)*
CC BY 4.0

Abstract*
The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).

Funded by*
Vinnova (grant no. 2019-02996)

Cite as
[1]

Related datasets
Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).

II. USAGE

Key applications
Intrinsic evaluation of Swedish word embeddings

Intended task(s)/usage(s)
Given a word pair A and B and a word C, find a word D such that A is to B as C is to D (A:B::C:D)

Recommended evaluation measures
Accuracy

Dataset function(s)
Few-shot training ('prompting'), testing

Recommended split(s)
A few-shot training set (aka 'prompt', 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples. Those are split into 2045 train samples and 18,593 test samples. No effort was made to control the balance of syntactic and semantic samples in train and test, the split was random.

Nature of the content*
Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.

Format*
TSV/JSONL with 5 columns/objects: four words and a category. The word to be predicted is called 'label', the given words 'pair1_element1', 'pair1_element2', and 'pair2_element1'.

Data source(s)*
Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/

Data collection method(s)*
Two Swedish native speakers proof-read the finished version. The inter-agreement score was calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki 

Data selection and filtering*
The dataset was postprocessed and corrected by Lars Borin and Aleksandrs Berdicevskis

Data preprocessing*
Does not apply

Data labeling*
Does not apply

Annotator characteristics
Two Swedish native speakers

IV. ETHICS AND CAVEATS

Ethical considerations

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
2023-03-05, Gerlof Bouma

Which changes have been made, compared to the previous version*
Minor format changes

Access to previous versions
Work in progress

This document created*
2021-05-20, Tosin Adewumi

This document last updated*
2023-03-05, Gerlof Bouma

Where to look for further details
[1],[2]

Documentation template version*
v1.1

VI. OTHER

Related projects

References
[1] Adewumi, T. P., Liwicki, F., &amp; Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword &amp; wikipedia corpora. arXiv preprint arXiv:2011.03281. 

[2] Adewumi, T. P., Liwicki, F., &amp; Liwicki, M. (2020). Exploring Swedish &amp; English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>