<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">DaLAJ-GED-Superlim 2.0</titl>
        <parTitl xml:lang="en">DaLAJ-GED-SuperLim 2.0</parTitl>
        <IDNo agency="SND">doi-10-23695-kxvz-tx42-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/KXVZ-TX42</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.23695/KXVZ-TX42">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">DaLAJ-GED-Superlim 2.0</titl>
        <parTitl xml:lang="en">DaLAJ-GED-SuperLim 2.0</parTitl>
        <IDNo agency="SND">doi-10-23695-kxvz-tx42-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/KXVZ-TX42</IDNo>
      </titlStmt>
      <rspStmt />
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2025-01-01" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2025-01-01" />
      </verStmt>
      <holdings URI="https://doi.org/10.23695/KXVZ-TX42">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
Dalaj-ged-superlim v2.0

Subtitle

Created by*
Elena Volodina, Yousuf Ali Mohammed, Språkbanken Text -- University of Gothenburg (name.surname@svenska.gu.se)

Publisher(s)*
Språkbanken Text -- University of Gothenburg

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/resurser/superlim

License(s)*
CC-BY 4.0

Abstract*
Dalaj v2.0 is an extension of Dalaj v1 [4], covering 30 error categories used in the SweLL-gold corpus [3]. Dalaj v2 is prepared for several sentence-level classification tasks, including linguistic acceptability (whether a sentence is grammatically correct or incorrect). The dataset contains ca 20K sentences written by non-native learners of Swedish, manipulated to contain one error per sentence and repeated for every new error. Each learner-written sentence is associated with information about the level of proficiency as assessed on an essay level and with mother tongue of the writer. Each incorrect (learner-written) sentence is paired with their manually corrected versions, which due to the one-error-per-sentence principle amounts to ca 6,5K unique correct sentences repeated multiple times times. To balance the number of incorrect sentences with equivalent number of correct ones, sentences from a course book corpus COCTAILL [2] have been extracted, keeping the same distribution into beginner-intermediate-advanced levels as among the incorrect sentences. Each COCTAILL sentence contains information about the (approximate) level of the coursebook at which the text is used for teaching. The dataset is split into training-validation-test sets. The test split has been manually proofread. Note that Dalaj-ged-superlim may be different from other version of Dalaj v2, see Section III for a description of changes.

Funded by*
Vinnova (grants no. 2020-02523, 2021-04165), Språkbanken Text

Cite as
Currently: [1]

Related datasets
Dalaj v1. Part of the SuperLim collection

II. USAGE

Key applications

Intended task(s)/usage(s)
1. Determine whether a sentence is grammatically correct (the official SuperLim task)

2. Find a text span in need of correction, if there is one

3. Determine the error type

4. Find a text span in need of correction, if there is one, and suggest a correction.

Recommended evaluation measures
'Krippendorff''s Alpha (the official SuperLim measure), F0.5, accuracy'

Dataset function(s)
Training, testing

Recommended split(s)
Train, dev, test (provided): 80:10:10. The test set has been manually proofread, train and dev have not.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
train: 35,581 sentences

dev: 4,702 sentences

test: 4,371 sentences

Nature of the content*
Sentences written by second language learners and corrected by experts + correct sentences from course books

Format*
JSONL file with one item per file. The item contains the following objects: sentence, label (correct or incorrect) and metadata. Metadata include error span (start and stop; numeration starts at 0; the range is half-open, start=stop denotes empty span (=a token has been omitted); empty if the sentence is correct); confusion pair (incorrect span and correction; empty if the sentence is correct); error label (empty if the sentence is correct); education level, l1 (native language), data source.

Data source(s)*
SweLL-gold essays

Data selection and filtering*
"All SweLL-gold sentences are used, except those containing "unintelligible" markup. Sentences with "consequence" (C) labels were partly deleted, partly converted to descriptive error-labels. When preparing the Superlim version, further filtering was applied: all sentences containing the * (the origin of this token was unclear), @ in the beginning of a sentence (denotes an omitted token) or $ (unintelligible symbol) were deleted. If @ occurred not in the beginning of sentence, the symbol itself was removed, but the sentence was preserved. The annotation was adjusted accordingly."

Data preprocessing*
"Sentence order has been randomized, so that full essays cannot be restored. Learner metadata was dropped (except mother tongues and proficiencly level). Essay metadata was dropped. In Dalaj2-ged, all punctuation marks had added spaces both before and after, the extra spaces are removed in the Superlim version ("detokenization")."

Data labeling*
Acceptability judgment; error identification; error correction; error tags (30 detailed categories), manually assigned

Annotator characteristics
second language experts / linguists

IV. ETHICS AND CAVEATS

Ethical considerations
SweLL-gold corpus is under GDPR restrictions. Randomized sentences withour metadata exempt risks for reidentification, and therefore allow data to be freely shared

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
20230122

Which changes have been made, compared to the previous version*
Extensive changes, see I and III.

Access to previous versions
NA

This document created*
20230123, Elena Volodina

This document last updated*
20230208, Aleksandrs Berdicevskis

Where to look for further details
forthcoming

Documentation template version*
v1.1

VI. OTHER

Related projects
SweLL

References
[1] Julia Klezl, Yousuf Ali Mohammed, Elena Volodina. (2022). Exploring Linguistic Acceptability in Swedish Learners’ Language. Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022), Belgium. NEALT Proceedings Series 47. [url]

[2] Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson (2014). You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144. [pdf]

[3] Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. [pdf]

[4] Elena Volodina, Yousuf Ali Mohammed, and Julia Klezl. (2021) DaLAJ - a dataset for linguistic acceptability judgments for Swedish.Proceedings of the 10th NLP4CALL workshop. Linköping Electronic University Press, Vol. 177:3. [pdf] [an extended version on arXiv]</abstract>
      <abstract xml:lang="sv" contentType="abstract">I. IDENTIFYING INFORMATION

Title*
Dalaj-ged-superlim v2.0

Subtitle

Created by*
Elena Volodina, Yousuf Ali Mohammed, Språkbanken Text -- University of Gothenburg (name.surname@svenska.gu.se)

Publisher(s)*
Språkbanken Text -- University of Gothenburg

Link(s) / permanent identifier(s)*
https://spraakbanken.gu.se/resurser/superlim

License(s)*
CC-BY 4.0

Abstract*
Dalaj v2.0 is an extension of Dalaj v1 [4], covering 30 error categories used in the SweLL-gold corpus [3]. Dalaj v2 is prepared for several sentence-level classification tasks, including linguistic acceptability (whether a sentence is grammatically correct or incorrect). The dataset contains ca 20K sentences written by non-native learners of Swedish, manipulated to contain one error per sentence and repeated for every new error. Each learner-written sentence is associated with information about the level of proficiency as assessed on an essay level and with mother tongue of the writer. Each incorrect (learner-written) sentence is paired with their manually corrected versions, which due to the one-error-per-sentence principle amounts to ca 6,5K unique correct sentences repeated multiple times times. To balance the number of incorrect sentences with equivalent number of correct ones, sentences from a course book corpus COCTAILL [2] have been extracted, keeping the same distribution into beginner-intermediate-advanced levels as among the incorrect sentences. Each COCTAILL sentence contains information about the (approximate) level of the coursebook at which the text is used for teaching. The dataset is split into training-validation-test sets. The test split has been manually proofread. Note that Dalaj-ged-superlim may be different from other version of Dalaj v2, see Section III for a description of changes.

Funded by*
Vinnova (grants no. 2020-02523, 2021-04165), Språkbanken Text

Cite as
Currently: [1]

Related datasets
Dalaj v1. Part of the SuperLim collection

II. USAGE

Key applications

Intended task(s)/usage(s)
1. Determine whether a sentence is grammatically correct (the official SuperLim task)

2. Find a text span in need of correction, if there is one

3. Determine the error type

4. Find a text span in need of correction, if there is one, and suggest a correction.

Recommended evaluation measures
'Krippendorff''s Alpha (the official SuperLim measure), F0.5, accuracy'

Dataset function(s)
Training, testing

Recommended split(s)
Train, dev, test (provided): 80:10:10. The test set has been manually proofread, train and dev have not.

III. DATA

Primary data*
Text

Language*
Swedish

Dataset in numbers*
train: 35,581 sentences

dev: 4,702 sentences

test: 4,371 sentences

Nature of the content*
Sentences written by second language learners and corrected by experts + correct sentences from course books

Format*
JSONL file with one item per file. The item contains the following objects: sentence, label (correct or incorrect) and metadata. Metadata include error span (start and stop; numeration starts at 0; the range is half-open, start=stop denotes empty span (=a token has been omitted); empty if the sentence is correct); confusion pair (incorrect span and correction; empty if the sentence is correct); error label (empty if the sentence is correct); education level, l1 (native language), data source.

Data source(s)*
SweLL-gold essays

Data selection and filtering*
"All SweLL-gold sentences are used, except those containing "unintelligible" markup. Sentences with "consequence" (C) labels were partly deleted, partly converted to descriptive error-labels. When preparing the Superlim version, further filtering was applied: all sentences containing the * (the origin of this token was unclear), @ in the beginning of a sentence (denotes an omitted token) or $ (unintelligible symbol) were deleted. If @ occurred not in the beginning of sentence, the symbol itself was removed, but the sentence was preserved. The annotation was adjusted accordingly."

Data preprocessing*
"Sentence order has been randomized, so that full essays cannot be restored. Learner metadata was dropped (except mother tongues and proficiencly level). Essay metadata was dropped. In Dalaj2-ged, all punctuation marks had added spaces both before and after, the extra spaces are removed in the Superlim version ("detokenization")."

Data labeling*
Acceptability judgment; error identification; error correction; error tags (30 detailed categories), manually assigned

Annotator characteristics
second language experts / linguists

IV. ETHICS AND CAVEATS

Ethical considerations
SweLL-gold corpus is under GDPR restrictions. Randomized sentences withour metadata exempt risks for reidentification, and therefore allow data to be freely shared

Things to watch out for

V. ABOUT DOCUMENTATION

Data last updated*
20230122

Which changes have been made, compared to the previous version*
Extensive changes, see I and III.

Access to previous versions
NA

This document created*
20230123, Elena Volodina

This document last updated*
20230208, Aleksandrs Berdicevskis

Where to look for further details
forthcoming

Documentation template version*
v1.1

VI. OTHER

Related projects
SweLL

References
[1] Julia Klezl, Yousuf Ali Mohammed, Elena Volodina. (2022). Exploring Linguistic Acceptability in Swedish Learners’ Language. Proceedings of the 11th Workshop on Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL 2022), Belgium. NEALT Proceedings Series 47. [url]

[2] Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson (2014). You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144. [pdf]

[3] Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. [pdf]

[4] Elena Volodina, Yousuf Ali Mohammed, and Julia Klezl. (2021) DaLAJ - a dataset for linguistic acceptability judgments for Swedish.Proceedings of the 10th NLP4CALL workshop. Linköping Electronic University Press, Vol. 177:3. [pdf] [an extended version on arXiv]</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>