NyLLex v2
https://doi.org/10.23695/GP75-6148
I. IDENTIFYING INFORMATION
Title*
NyLLex v 2.0
Subtitle
A Novel Resource of Swedish Words Annotated with Reading Proficiency Level
Created by*
Daniel Holmer (daniel.holmer@liu.seOpens in a new tab), Evelina Rennes
(evelina.rennes@liu.seOpens in a new tab)
License(s)*
CC BY 4.0
Abstract*
NyLLex is a lexical resource derived from books published by Sweden´s
largest publisher of easy language texts. The entries are annotated with
frequency counts distributed over six reading proficiency levels.
Funded by*
Vetenskapsrådet (2020-03580)
Cite as
[1]
Related datasets
[2], [3]
II. USAGE
Key applications
Text complexity analysis
Intended task(s)/usage(s)
(1) Lexical analysis of easy language texts. (2) Lexical simplification
Recommended evaluation measures
-
Dataset function(s)
-
Recommended split(s)
-
III. DATA
Primary data*
Words (text)
Language*
Swedish
Dataset in numbers*
14983 entries
Nature of the content*
Each entry in the resource contains a word, its part-of-speech tag
(SUC-style), and a number of frequencies over different readability
levels. Multi-word expressions are denoted by multiple words linked by
underscores.
Format*
Comma-separated values (CSV) with the following columns:
word: a word in its lemma form
POS: a part-of-speech tag in the SUC-format
level1_freq - level6_freq (six headers): the dispersed frequency of the
word in the given reading proficiency level
total_freq: the adjusted frequency for the word across all reading
proficiency levels
n_level1 - n_level6 (six headers): raw frequency of the word in the given
reading proficiency level
n_total: raw frequency for the word across all reading proficiency levels
Data source(s)*
The words are collected from 247 easy language books published by
NyponVilja förlag. The books were OCR-scanned from PDF-format and
preprocessed by the authors. Unfortunately, the book dataset is not
publicly available due to copyright reasons.
Data collection method(s)*
See [1]
Data selection and filtering*
See [1]
Data preprocessing*
See [1]
Data labeling*
"See "Format""
Annotator characteristics
-
IV. ETHICS AND CAVEATS
Ethical considerations
The books contain words that when taken out of context can be seen as
offensive. The authors have manually removed such entries, but can not
guarantee that the resource is completely devoid of offensive words.
Things to watch out for
-
V. ABOUT DOCUMENTATION
Data last updated*
20220909
Which changes have been made, compared to the previous version*
This version contain more entries than described in the original paper.
This is due to two reasons: 1) An increased number of books available for
the source material (from 247 to 280). 2) An updated method to filter out
bad entries due to erraneous OCR-readings from the soruce PDFs. In
practice, this means that the number of entries (unique words) of the
resource is signifcantly larger (more than double the number of entries)
in this version, since entries that only appear once in the source
material are no longer discarded. However, for the total frequency counts
for all entries, the difference between this updated version and the paper
version is only around 2%.
Access to previous versions
-
This document created*
20221219, Daniel Holmer (daniel.holmer@liu.seOpens in a new tab)
This document last updated*
20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.seOpens in a new tab)
Where to look for further details
See [1] and https://gitlab.liu.se/danho69/nyllexOpens in a new tab
VI. OTHER
Related projects
References
"[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of
Swedish Words Annotated with Reading Proficiency Level. In Proceedings of
the Thirteenth Language Resources and Evaluation Conference, pages
1326–1331, Marseille, France. European Language Resources Association.
https://aclanthology.org/2022.lrec-1.141.pdfOpens in a new tab
[2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016.
SVALex: a CEFR-graded lexical resource for Swedish foreign and second
language learners. Proceedings of LREC 2016, Slovenia.
[3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and
Thomas François. 2016. SweLLex: Second language learners’ productive
vocabulary. In Proceedings of the joint workshop on NLP for Computer
Assisted Language Learning and NLP for Language Acquisition, pages 76–84,
Umeå, Sweden. LiU Electronic Press."
Go to data source
Opens in a new tabhttps://doi.org/10.23695/GP75-6148
Citation and access
Citation and access
Creator/Principal investigator(s):
Research principal:
Citation:
Language:
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Metadata
Metadata
