NyLLex v2

doi:10.23695/GP75-6148

NyLLex v2

https://doi.org/10.23695/GP75-6148

I. IDENTIFYING INFORMATION Title* NyLLex v 2.0 Subtitle A Novel Resource of Swedish Words Annotated with Reading Proficiency Level Created by* Daniel Holmer (daniel.holmer@liu.se), Evelina Rennes (evelina.rennes@liu.se) License(s)* CC BY 4.0 Abstract* NyLLex is a lexical resource derived from books published by Sweden´s largest publisher of easy language texts. The entries are annotated with frequency counts distributed over six reading proficiency levels. Funded by* Vetenskapsrådet (2020-03580) Cite as [1] Related datasets [2], [3] II. USAGE Key applications Text complexity analysis Intended task(s)/usage(s) (1) Lexical analysis of easy language texts. (2) Lexical simplification Recommended evaluation measures - Dataset function(s) - Recommended split(s) - III. DATA Primary data* Words (text) Language* Swedish Dataset in numbers* 14983 entries Nature of the content* Each entry in the resource contains a word, its part-of-speech tag (SUC-style), and a number of frequencies over different readability levels. Multi-word expressions are denoted by multiple words linked by underscores. Format* Comma-separated values (CSV) with the following columns: word: a word in its lemma form POS: a part-of-speech tag in the SUC-format level1_freq - level6_freq (six headers): the dispersed frequency of the word in the given reading proficiency level total_freq: the adjusted frequency for the word across all reading proficiency levels n_level1 - n_level6 (six headers): raw frequency of the word in the given reading proficiency level n_total: raw frequency for the word across all reading proficiency levels Data source(s)* The words are collected from 247 easy language books published by NyponVilja förlag. The books were OCR-scanned from PDF-format and preprocessed by the authors. Unfortunately, the book dataset is not publicly available due to copyright reasons. Data collection method(s)* See [1] Data selection and filtering* See [1] Data preprocessing* See [1] Data labeling* "See "Format"" Annotator characteristics - IV. ETHICS AND CAVEATS Ethical considerations The books contain words that when taken out of context can be seen as offensive. The authors have manually removed such entries, but can not guarantee that the resource is completely devoid of offensive words. Things to watch out for - V. ABOUT DOCUMENTATION Data last updated* 20220909 Which changes have been made, compared to the previous version* This version contain more entries than described in the original paper. This is due to two reasons: 1) An increased number of books available for the source material (from 247 to 280). 2) An updated method to filter out bad entries due to erraneous OCR-readings from the soruce PDFs. In practice, this means that the number of entries (unique words) of the resource is signifcantly larger (more than double the number of entries) in this version, since entries that only appear once in the source material are no longer discarded. However, for the total frequency counts for all entries, the difference between this updated version and the paper version is only around 2%. Access to previous versions - This document created* 20221219, Daniel Holmer (daniel.holmer@liu.se) This document last updated* 20230608, Aleksandrs Berdicevskis (aleksandrs.berdicevskis@gu.se) Where to look for further details See [1] and https://gitlab.liu.se/danho69/nyllex VI. OTHER Related projects References "[1]. Daniel Holmer and Evelina Rennes. 2022. NyLLex: A Novel Resource of Swedish Words Annotated with Reading Proficiency Level. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1326–1331, Marseille, France. European Language Resources Association. https://aclanthology.org/2022.lrec-1.141.pdf [2]. Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia. [3]. Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, and Thomas François. 2016. SweLLex: Second language learners’ productive vocabulary. In Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, pages 76–84, Umeå, Sweden. LiU Electronic Press."

Go to data source

https://doi.org/10.23695/GP75-6148