<ddi:DDIInstance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:instance:3_3 http://ddialliance.org/Specification/DDI-Lifecycle/3.3/XMLSchema/instance.xsd" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ddi="ddi:instance:3_3" xmlns:r="ddi:reusable:3_3" xmlns:s="ddi:studyunit:3_3" xmlns:d="ddi:datacollection:3_3" xmlns:a="ddi:archive:3_3" xmlns:c="ddi:conceptualcomponent:3_3" xmlns:cm="ddi:comparative:3_3" xmlns:g="ddi:group:3_3" xmlns:l="ddi:logicalproduct:3_3" xmlns:p="ddi:physicaldataproduct:3_3" xmlns:pi="ddi:physicalinstance:3_3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" isMaintainable="true" scopeOfUniqueness="Agency">
  <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148:0</r:URN>
  <r:Agency>SND</r:Agency>
  <r:ID>doi-10-23695-gp75-6148</r:ID>
  <r:Version>0</r:Version>
  <g:ResourcePackage>
    <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.ResourcePackage:2.0</r:URN>
    <r:OtherMaterialScheme>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.OtherMaterialScheme:2.0</r:URN>
    </r:OtherMaterialScheme>
    <a:OrganizationScheme>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.OrganizationScheme-0:2.0</r:URN>
      <a:Organization>
        <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.Organization-0:2.0</r:URN>
        <a:OrganizationIdentification>
          <a:OrganizationName>
            <r:String xml:lang="en">Språkbanken Text</r:String>
          </a:OrganizationName>
        </a:OrganizationIdentification>
      </a:Organization>
    </a:OrganizationScheme>
  </g:ResourcePackage>
  <s:StudyUnit>
    <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.StudyUnit:2.0</r:URN>
    <r:UserID typeOfUserID="datasetIdentifier">doi-10-23695-gp75-6148</r:UserID>
    <r:Citation>
      <r:Title>
        <r:String xml:lang="sv">NyLLex v2</r:String>
        <r:String xml:lang="en">NyLLex v2</r:String>
      </r:Title>
      <r:Creator>
        <r:CreatorReference>
          <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.Individual-0:2.0</r:URN>
          <r:TypeOfObject>Individual</r:TypeOfObject>
        </r:CreatorReference>
      </r:Creator>
      <r:Publisher>
        <r:PublisherName>
          <r:String xml:lang="sv">Göteborgs universitet</r:String>
          <r:String xml:lang="en">University of Gothenburg</r:String>
        </r:PublisherName>
      </r:Publisher>
      <r:Publisher>
        <r:PublisherName>
          <r:String xml:lang="sv">Göteborgs universitet</r:String>
          <r:String xml:lang="en">University of Gothenburg</r:String>
        </r:PublisherName>
      </r:Publisher>
      <r:PublicationDate>
        <r:SimpleDate>2024-01-01</r:SimpleDate>
      </r:PublicationDate>
      <r:InternationalIdentifier>
        <r:IdentifierContent>10.23695/GP75-6148</r:IdentifierContent>
        <r:ManagingAgency controlledVocabularyAgencyName="DOI">DOI</r:ManagingAgency>
      </r:InternationalIdentifier>
    </r:Citation>
    <r:Abstract>
      <r:Content xml:lang="sv">I. IDENTIFYING INFORMATION

Title*
NyLLex v 2.0

Subtitle

      A Novel Resource of Swedish Words Annotated with Reading Proficiency Level

Created by*

      Daniel Holmer (daniel.holmer@liu.se), Evelina Rennes
      (evelina.rennes@liu.se)

License(s)*
CC BY 4.0

Abstract*

      NyLLex is a lexical resource derived from books published by Sweden´s
      largest publisher of easy language texts. The entries are annotated with
      frequency counts distributed over six reading proficiency levels.

Funded by*
Vetenskapsrådet (2020-03580)

Cite as
[1]

Related datasets
[2], [3]

II. USAGE

Key applications
Text complexity analysis

Intended task(s)/usage(s)

      (1) Lexical analysis of easy language texts. (2) Lexical simplification

Recommended evaluation measures
-

Dataset function(s)
-

Recommended split(s)
-

III. DATA

Primary data*
Words (text)

Language*
Swedish

Dataset in numbers*
14983 entries

Nature of the content*

      Each entry in the resource contains a word, its part-of-speech tag
      (SUC-style), and a number of frequencies over different readability
      levels. Multi-word expressions are denoted by multiple words linked by
      underscores.

Format*
Comma-separated values (CSV) with the following columns:

word: a word in its lemma form

POS: a part-of-speech tag in the SUC-format

      level1_freq - level6_freq (six headers): the dispersed frequency of the
      word in the given reading proficiency level

      total_freq: the adjusted frequency for the word across all reading
      proficiency levels

      n_level1 - n_level6 (six headers): raw frequency of the word in the given
      reading proficiency level

      n_total: raw frequency for the word across all reading proficiency levels

Data source(s)*

      The words are collected from 247 easy language books published by
      NyponVilja förlag. The books were OCR-scanned from PDF-format and
      preprocessed by the authors. Unfortunately, the book dataset is not
      publicly available due to copyright reasons.

Data collection method(s)*
See [1]

Data selection and filtering*
See [1]

Data preprocessing*
See [1]

Data labeling*
"See "Format""

Annotator characteristics
-

IV. ETHICS AND CAVEATS

Ethical considerations

      The books contain words that when taken out of context can be seen as
      offensive. The authors have manually removed such entries, but can not
      guarantee that the resource is completely devoid of offensive words.

Things to watch out for
-

V. ABOUT DOCUMENTATION

Data last updated*
20220909

Which changes have been made, compared to the previous version*

      This version contain more entries than described in the original paper.
      This is due to two reasons: 1) An increased number of books available for
      the source material (from 247 to 280). 2) An updated method to filter out
      bad entries due to erraneous OCR-readings from the soruce PDFs. In
      practice, this means that the number of entries (unique words) of the
      resource is signifcantly larger (more than double the number of entries)
      in this version, since entries that only appear once in the source
      material are no longer discarded. However, for the total frequency counts
      for all entries, the difference between this updated version and the paper
      version is only around 2%.

Access to previous versions
-

This document created*
20221219, Daniel Holmer (daniel.holmer@liu.se)

This document last updated*
20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se)

Where to look for further details
See [1] and https://gitlab.liu.se/danho69/nyllex/

VI. OTHER

Related projects

References

      "[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of
      Swedish Words Annotated with Reading Proficiency Level. In Proceedings of
      the Thirteenth Language Resources and Evaluation Conference, pages
      1326–1331, Marseille, France. European Language Resources Association.
      https://aclanthology.org/2022.lrec-1.141.pdf 
      [2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016.
      SVALex: a CEFR-graded lexical resource for Swedish foreign and second
      language learners. Proceedings of LREC 2016, Slovenia. 
      [3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and
      Thomas François. 2016. SweLLex: Second language learners’ productive
      vocabulary. In Proceedings of the joint workshop on NLP for Computer
      Assisted Language Learning and NLP for Language Acquisition, pages 76–84,
      Umeå, Sweden. LiU Electronic Press."</r:Content>
      <r:Content xml:lang="en">I. IDENTIFYING INFORMATION

Title*
NyLLex v 2.0

Subtitle

      A Novel Resource of Swedish Words Annotated with Reading Proficiency Level

Created by*

      Daniel Holmer (daniel.holmer@liu.se), Evelina Rennes
      (evelina.rennes@liu.se)

License(s)*
CC BY 4.0

Abstract*

      NyLLex is a lexical resource derived from books published by Sweden´s
      largest publisher of easy language texts. The entries are annotated with
      frequency counts distributed over six reading proficiency levels.

Funded by*
Vetenskapsrådet (2020-03580)

Cite as
[1]

Related datasets
[2], [3]

II. USAGE

Key applications
Text complexity analysis

Intended task(s)/usage(s)

      (1) Lexical analysis of easy language texts. (2) Lexical simplification

Recommended evaluation measures
-

Dataset function(s)
-

Recommended split(s)
-

III. DATA

Primary data*
Words (text)

Language*
Swedish

Dataset in numbers*
14983 entries

Nature of the content*

      Each entry in the resource contains a word, its part-of-speech tag
      (SUC-style), and a number of frequencies over different readability
      levels. Multi-word expressions are denoted by multiple words linked by
      underscores.

Format*
Comma-separated values (CSV) with the following columns:

word: a word in its lemma form

POS: a part-of-speech tag in the SUC-format

      level1_freq - level6_freq (six headers): the dispersed frequency of the
      word in the given reading proficiency level

      total_freq: the adjusted frequency for the word across all reading
      proficiency levels

      n_level1 - n_level6 (six headers): raw frequency of the word in the given
      reading proficiency level

      n_total: raw frequency for the word across all reading proficiency levels

Data source(s)*

      The words are collected from 247 easy language books published by
      NyponVilja förlag. The books were OCR-scanned from PDF-format and
      preprocessed by the authors. Unfortunately, the book dataset is not
      publicly available due to copyright reasons.

Data collection method(s)*
See [1]

Data selection and filtering*
See [1]

Data preprocessing*
See [1]

Data labeling*
"See "Format""

Annotator characteristics
-

IV. ETHICS AND CAVEATS

Ethical considerations

      The books contain words that when taken out of context can be seen as
      offensive. The authors have manually removed such entries, but can not
      guarantee that the resource is completely devoid of offensive words.

Things to watch out for
-

V. ABOUT DOCUMENTATION

Data last updated*
20220909

Which changes have been made, compared to the previous version*

      This version contain more entries than described in the original paper.
      This is due to two reasons: 1) An increased number of books available for
      the source material (from 247 to 280). 2) An updated method to filter out
      bad entries due to erraneous OCR-readings from the soruce PDFs. In
      practice, this means that the number of entries (unique words) of the
      resource is signifcantly larger (more than double the number of entries)
      in this version, since entries that only appear once in the source
      material are no longer discarded. However, for the total frequency counts
      for all entries, the difference between this updated version and the paper
      version is only around 2%.

Access to previous versions
-

This document created*
20221219, Daniel Holmer (daniel.holmer@liu.se)

This document last updated*
20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se)

Where to look for further details
See [1] and https://gitlab.liu.se/danho69/nyllex/

VI. OTHER

Related projects

References

      "[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of
      Swedish Words Annotated with Reading Proficiency Level. In Proceedings of
      the Thirteenth Language Resources and Evaluation Conference, pages
      1326–1331, Marseille, France. European Language Resources Association.
      https://aclanthology.org/2022.lrec-1.141.pdf 
      [2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016.
      SVALex: a CEFR-graded lexical resource for Swedish foreign and second
      language learners. Proceedings of LREC 2016, Slovenia. 
      [3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and
      Thomas François. 2016. SweLLex: Second language learners’ productive
      vocabulary. In Proceedings of the joint workshop on NLP for Computer
      Assisted Language Learning and NLP for Language Acquisition, pages 76–84,
      Umeå, Sweden. LiU Electronic Press."</r:Content>
    </r:Abstract>
    <r:Coverage>
      <r:TopicalCoverage>
        <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.TopicalCoverage:2.0</r:URN>
        <r:Subject xml:lang="en" controlledVocabularyID="10208" controlledVocabularyName="Standard för svensk indelning av forskningsämnen 2025">Natural Language Processing</r:Subject>
        <r:Subject xml:lang="sv" controlledVocabularyID="10208" controlledVocabularyName="Standard för svensk indelning av forskningsämnen 2025">Språkbehandling och datorlingvistik</r:Subject>
      </r:TopicalCoverage>
      <r:SpatialCoverage />
    </r:Coverage>
    <a:Archive>
      <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.Archive:2.0</r:URN>
      <a:ArchiveSpecific>
        <a:Item>
          <a:Access>
            <r:URN>urn:ddi:se.researchdata:doi-10-23695-gp75-6148.Archive-ArchiveSpecificType-AccessType:2.0</r:URN>
            <a:TypeOfAccess controlledVocabularyName="info:eu-repo-Access-Terms vocabulary"></a:TypeOfAccess>
          </a:Access>
          <a:DataFileQuantity>0</a:DataFileQuantity>
        </a:Item>
      </a:ArchiveSpecific>
    </a:Archive>
  </s:StudyUnit>
</ddi:DDIInstance>