<ddi:DDIInstance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:instance:3_3 http://ddialliance.org/Specification/DDI-Lifecycle/3.3/XMLSchema/instance.xsd" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ddi="ddi:instance:3_3" xmlns:r="ddi:reusable:3_3" xmlns:s="ddi:studyunit:3_3" xmlns:d="ddi:datacollection:3_3" xmlns:a="ddi:archive:3_3" xmlns:c="ddi:conceptualcomponent:3_3" xmlns:cm="ddi:comparative:3_3" xmlns:g="ddi:group:3_3" xmlns:l="ddi:logicalproduct:3_3" xmlns:p="ddi:physicaldataproduct:3_3" xmlns:pi="ddi:physicalinstance:3_3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" isMaintainable="true" scopeOfUniqueness="Agency">
  <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16:0</r:URN>
  <r:Agency>SND</r:Agency>
  <r:ID>doi-10-23695-vbqg-jr16</r:ID>
  <r:Version>0</r:Version>
  <g:ResourcePackage>
    <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.ResourcePackage:2.0</r:URN>
    <r:OtherMaterialScheme>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.OtherMaterialScheme:2.0</r:URN>
    </r:OtherMaterialScheme>
    <a:OrganizationScheme>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.OrganizationScheme-0:2.0</r:URN>
      <a:Individual>
        <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.Individual-0:2.0</r:URN>
        <a:IndividualIdentification>
          <a:IndividualName>
            <a:FullName>
              <r:String>Hengchen, Simon</r:String>
            </a:FullName>
          </a:IndividualName>
        </a:IndividualIdentification>
      </a:Individual>
      <a:Individual>
        <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.Individual-0:2.0</r:URN>
        <a:IndividualIdentification>
          <a:IndividualName>
            <a:FullName>
              <r:String>Tahmasebi, Nina</r:String>
            </a:FullName>
          </a:IndividualName>
        </a:IndividualIdentification>
      </a:Individual>
    </a:OrganizationScheme>
  </g:ResourcePackage>
  <s:StudyUnit>
    <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.StudyUnit:2.0</r:URN>
    <r:UserID typeOfUserID="datasetIdentifier">doi-10-23695-vbqg-jr16</r:UserID>
    <r:Citation>
      <r:Title>
        <r:String xml:lang="sv">SuperSim (paketerat för Superlim) 2.0</r:String>
        <r:String xml:lang="en">SuperSim (repackaged for Superlim) 2.0</r:String>
      </r:Title>
      <r:Creator>
        <r:CreatorReference>
          <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.Individual-0:2.0</r:URN>
          <r:TypeOfObject>Individual</r:TypeOfObject>
        </r:CreatorReference>
      </r:Creator>
      <r:Publisher>
        <r:PublisherName>
          <r:String xml:lang="sv">Göteborgs universitet</r:String>
          <r:String xml:lang="en">University of Gothenburg</r:String>
        </r:PublisherName>
      </r:Publisher>
      <r:Publisher>
        <r:PublisherName>
          <r:String xml:lang="sv">Göteborgs universitet</r:String>
          <r:String xml:lang="en">University of Gothenburg</r:String>
        </r:PublisherName>
      </r:Publisher>
      <r:PublicationDate>
        <r:SimpleDate>2024-01-01</r:SimpleDate>
      </r:PublicationDate>
      <r:InternationalIdentifier>
        <r:IdentifierContent>10.23695/VBQG-JR16</r:IdentifierContent>
        <r:ManagingAgency controlledVocabularyAgencyName="DOI">DOI</r:ManagingAgency>
      </r:InternationalIdentifier>
    </r:Citation>
    <r:Abstract>
      <r:Content xml:lang="sv">I. IDENTIFYING INFORMATION

Title*
SuperSim (repackaged for Superlim) v1.1

Subtitle
A test set for word similarity and relatedness in Swedish 

Created by*
Simon Hengchen (simon.hengchen@gu.se), Nina Tahmasebi (nina.tahmasebi@gu.se)

Publisher(s)*
Språkbanken Text

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/en/resources/superlim

License(s)*
CC BY 4.0

Abstract*
SuperSim is a large-scale similarity and relatedness test set  for  Swedish  built  with  expert  human judgments. The test set is composed  of 1360 word-pairs independently judged for both relatedness and similarity by five annotators.

Funded by*
Swedish Research Council (grant no. 2018-01184 to Nina Tahmasebi); Språkbanken Text

Cite as
[1]

Related datasets
See https://doi.org/10.5281/zenodo.4660084 for the complete data set accompanying [1], including baseline models and corpus material. The data described in this documentation sheet is the gold data from this larger archive. This repackaging of the gold data was done in the context of the SuperLim collection. See https://spraakbanken.gu.se/en/resources/superlim

II. USAGE

Key applications
Evaluation of language models

Intended task(s)/usage(s)
(1) Predict semantic similarity of word pairs from a language model

(2) Predict semantic relatedness of word paris from a language model

Recommended evaluation measures
Krippendorff's alpha (the official SuperLim measure), Spearman's rho

Dataset function(s)
Few-shot training ("prompting"), testing

Recommended split(s)
A few-shot training set (aka "prompt", 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored. The word pairs in the train test are the same for the two tasks.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
1360 word pairs with semantic similarity and semantic relatedness scores, of those 131 train items and 1229 test items.

Nature of the content*
Semantic similarity refers to the extent to which two concepts share semantic properties. Synonymy is the culmination of this concept. Relatedness is a looser lexical conceptual relation that refers to the general (psychological) assocation that may arise for instance because there are causal or instrumental relations between two concepts, or because concepts co-occur frequently, etc, etc. Similarity and relatedness are given as scores between 0 and 10, these scores are in turn averages of judgements on an 11-point scale (0–10).

Format*
The data is split over two files, one for each score. The files are provided both as JSONL and tab separated. TSVs contain the following 8 columns:

(1) word 1

(2) word 2

(3)–(7) individual annotator scores (integer valued)

(8) average score (real valued)

Data source(s)*
The word pairs were translated from SimLex-999 [2] and WordSim353 [3]. The complete set was manually checked and if needed pairs were adjusted (split into multiple or removed) depending on the lexical distinctions made in Swedish. The similarity and relatedness judgements were collected from five annotators, who were paid for the assignment. One of the annotators was also involved in translating the dataset. See discussion in [1].

Data collection method(s)*
Online collection of judgements from (paid) annotators. Annotators used written instructions from SimLex-999 [2]. See discussion in [1].

Data selection and filtering*
See discussion in [1]

Data preprocessing*
See discussion in [1]

Data labeling*
Both the similarity and relatedness scores are manual (gold standard).

Annotator characteristics
All annotators are native speakers of Swedish who hold linguistic degrees. Two have prior lexicographic experience. See [1] for more details.

IV. ETHICS AND CAVEATS

Ethical considerations
None to report.

Things to watch out for
The word pairs are presented out of context. Superlim presently does not prescribe a methodology for the application of contextual (dynamic) language models to this data, which means we can expect considerable variation between test data uses. For reasons of comparability and reproducability, users must make sure to report their chosen method clearly. See also the remarks in the FAQ on https://spraakbanken.gu.se/resurser/superlim

V. ABOUT DOCUMENTATION

Data last updated*
20220920 (v1.1), Aleksandrs Berdicevskis

Which changes have been made, compared to the previous version*
Minor format changes

Access to previous versions
Work in progress

This document created*
20210611, Gerlof Bouma (gerlof.bouma@gu.se)

This document last updated*
20230203, Aleksandrs Berdicevskis

Where to look for further details
The attached readme file

Documentation template version*
v1.1

VI. OTHER

Related projects
SimLex-999 [2]; WordSim353 [3]

References
[1] Hengchen and Tahmasebi (2021): SuperSim: a test set for word similarity and relatedness in Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). https://ep.liu.se/ecp/178/027/ecp2021178027.pdf

[2] Hill, Reichart and Korhonen (2015): SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4): 665–695. https://doi.org/10.1162/COLI_a_00237

[3] Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin (2002): Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1):116-131. https://doi.org/10.1145/503104.503110</r:Content>
      <r:Content xml:lang="en">I. IDENTIFYING INFORMATION

Title*
SuperSim (repackaged for Superlim) v1.1

Subtitle
A test set for word similarity and relatedness in Swedish 

Created by*
Simon Hengchen (simon.hengchen@gu.se), Nina Tahmasebi (nina.tahmasebi@gu.se)

Publisher(s)*
Språkbanken Text

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/en/resources/superlim

License(s)*
CC BY 4.0

Abstract*
SuperSim is a large-scale similarity and relatedness test set  for  Swedish  built  with  expert  human judgments. The test set is composed  of 1360 word-pairs independently judged for both relatedness and similarity by five annotators.

Funded by*
Swedish Research Council (grant no. 2018-01184 to Nina Tahmasebi); Språkbanken Text

Cite as
[1]

Related datasets
See https://doi.org/10.5281/zenodo.4660084 for the complete data set accompanying [1], including baseline models and corpus material. The data described in this documentation sheet is the gold data from this larger archive. This repackaging of the gold data was done in the context of the SuperLim collection. See https://spraakbanken.gu.se/en/resources/superlim

II. USAGE

Key applications
Evaluation of language models

Intended task(s)/usage(s)
(1) Predict semantic similarity of word pairs from a language model

(2) Predict semantic relatedness of word paris from a language model

Recommended evaluation measures
Krippendorff's alpha (the official SuperLim measure), Spearman's rho

Dataset function(s)
Few-shot training ("prompting"), testing

Recommended split(s)
A few-shot training set (aka "prompt", 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored. The word pairs in the train test are the same for the two tasks.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
1360 word pairs with semantic similarity and semantic relatedness scores, of those 131 train items and 1229 test items.

Nature of the content*
Semantic similarity refers to the extent to which two concepts share semantic properties. Synonymy is the culmination of this concept. Relatedness is a looser lexical conceptual relation that refers to the general (psychological) assocation that may arise for instance because there are causal or instrumental relations between two concepts, or because concepts co-occur frequently, etc, etc. Similarity and relatedness are given as scores between 0 and 10, these scores are in turn averages of judgements on an 11-point scale (0–10).

Format*
The data is split over two files, one for each score. The files are provided both as JSONL and tab separated. TSVs contain the following 8 columns:

(1) word 1

(2) word 2

(3)–(7) individual annotator scores (integer valued)

(8) average score (real valued)

Data source(s)*
The word pairs were translated from SimLex-999 [2] and WordSim353 [3]. The complete set was manually checked and if needed pairs were adjusted (split into multiple or removed) depending on the lexical distinctions made in Swedish. The similarity and relatedness judgements were collected from five annotators, who were paid for the assignment. One of the annotators was also involved in translating the dataset. See discussion in [1].

Data collection method(s)*
Online collection of judgements from (paid) annotators. Annotators used written instructions from SimLex-999 [2]. See discussion in [1].

Data selection and filtering*
See discussion in [1]

Data preprocessing*
See discussion in [1]

Data labeling*
Both the similarity and relatedness scores are manual (gold standard).

Annotator characteristics
All annotators are native speakers of Swedish who hold linguistic degrees. Two have prior lexicographic experience. See [1] for more details.

IV. ETHICS AND CAVEATS

Ethical considerations
None to report.

Things to watch out for
The word pairs are presented out of context. Superlim presently does not prescribe a methodology for the application of contextual (dynamic) language models to this data, which means we can expect considerable variation between test data uses. For reasons of comparability and reproducability, users must make sure to report their chosen method clearly. See also the remarks in the FAQ on https://spraakbanken.gu.se/resurser/superlim

V. ABOUT DOCUMENTATION

Data last updated*
20220920 (v1.1), Aleksandrs Berdicevskis

Which changes have been made, compared to the previous version*
Minor format changes

Access to previous versions
Work in progress

This document created*
20210611, Gerlof Bouma (gerlof.bouma@gu.se)

This document last updated*
20230203, Aleksandrs Berdicevskis

Where to look for further details
The attached readme file

Documentation template version*
v1.1

VI. OTHER

Related projects
SimLex-999 [2]; WordSim353 [3]

References
[1] Hengchen and Tahmasebi (2021): SuperSim: a test set for word similarity and relatedness in Swedish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). https://ep.liu.se/ecp/178/027/ecp2021178027.pdf

[2] Hill, Reichart and Korhonen (2015): SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4): 665–695. https://doi.org/10.1162/COLI_a_00237

[3] Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman and Ruppin (2002): Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1):116-131. https://doi.org/10.1145/503104.503110</r:Content>
    </r:Abstract>
    <r:Coverage>
      <r:TopicalCoverage>
        <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.TopicalCoverage:2.0</r:URN>
        <r:Subject xml:lang="en" controlledVocabularyID="10208" controlledVocabularyName="Standard för svensk indelning av forskningsämnen 2025">Natural Language Processing</r:Subject>
        <r:Subject xml:lang="sv" controlledVocabularyID="10208" controlledVocabularyName="Standard för svensk indelning av forskningsämnen 2025">Språkbehandling och datorlingvistik</r:Subject>
      </r:TopicalCoverage>
      <r:SpatialCoverage />
    </r:Coverage>
    <a:Archive>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.Archive:2.0</r:URN>
      <a:ArchiveSpecific>
        <a:Item>
          <a:Access>
            <r:URN>urn:ddi:se.researchdata:doi-10-23695-vbqg-jr16.Archive-ArchiveSpecificType-AccessType:2.0</r:URN>
            <a:TypeOfAccess controlledVocabularyName="info:eu-repo-Access-Terms vocabulary"></a:TypeOfAccess>
          </a:Access>
          <a:DataFileQuantity>0</a:DataFileQuantity>
        </a:Item>
      </a:ArchiveSpecific>
    </a:Archive>
  </s:StudyUnit>
</ddi:DDIInstance>