<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Ordklasstaggningsmodell: Marmot</titl>
        <parTitl xml:lang="en">POS-tagging model: Marmot</parTitl>
        <IDNo agency="SND">doi-10-23695-aryw-nh78-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/ARYW-NH78</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.23695/ARYW-NH78">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Ordklasstaggningsmodell: Marmot</titl>
        <parTitl xml:lang="en">POS-tagging model: Marmot</parTitl>
        <IDNo agency="SND">doi-10-23695-aryw-nh78-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.23695/ARYW-NH78</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="">Språkbanken Text</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2024-01-01" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2024-01-01" />
      </verStmt>
      <holdings URI="https://doi.org/10.23695/ARYW-NH78">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">Model
 Marmot is a part-of-speech tagger that performs for Swedish somewhat worse than state-of-the-art neural models, but still yields very good results, works much faster and does not require a GPU. We provide two models for Marmot.
marmot_eval is trained on SUC3 and Talbanken_SBX_dev, using Saldo as dictionary. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. The evaluation results are reported in the table below.

Test set
Exact match
POS
MSD

Talbanken_SBX_test
0.973
0.982
0.988

SIC2
0.921
0.934
0.958

 Read more about the evaluation here.
marmot_full is trained on SUC3 + Talbanken_SBX_test + Talbanken_SBX_dev + SIC2 (with Saldo as dictionary). We cannot evaluate the performance of this model, but we expect it to perform better than marmot_eval, or at least not worse.
Tagging and training
Download Marmot and the necessary dependencies. Download SALDO (converted to the necessary format) here. Download our scripts from this repository.
The scripts use a tab-separated three-column format: token, POS (without MSD), MSD. Use conllu_to_tab.rb to convert CONLL(U) to the two-column format (install Ruby 1.9+ and run ruby conllu_to_tab 2 n, where n is the number of the column you want to use (if you are converting our CONLLU files, use 4). Run ruby convert_col2_to_marmot.rb to convert the resulting col2 file to Marmot's col3).

Tagging
Use java -cp marmot.jar marmot.morph.cmd.Annotator --model-file model_name.marmot --test-file form-index=0,test_corpus.col1 --pred-file output_name.conll to tag a corpus using a pretrained model. The output corpus will be in a CONLL format with a somewhat unusual order of columns, use convert_marmot_to_conllu.rb to convert it to a usual CONLLU.
Training your own models
Run Marmot: java -Xmx5G -cp marmot.jar marmot.morph.cmd.Trainer -train-file form-index=0,tag-index=1,morph-index=2,corpus.col3 -tag-morph true -model-file model_name.marmot  subtag-separator "." -type-dict saldo_marmot.txt,indexes=[2,3]</abstract>
      <abstract xml:lang="sv" contentType="abstract">Model
 Marmot is a part-of-speech tagger that performs for Swedish somewhat worse than state-of-the-art neural models, but still yields very good results, works much faster and does not require a GPU. We provide two models for Marmot.
marmot_eval is trained on SUC3 and Talbanken_SBX_dev, using Saldo as dictionary. The advantage of this model is that it can be evaluated, using Talbanken_SBX_test or SIC2. The evaluation results are reported in the table below.

Test set
Exact match
POS
MSD

Talbanken_SBX_test
0.973
0.982
0.988

SIC2
0.921
0.934
0.958

 Read more about the evaluation here.
marmot_full is trained on SUC3 + Talbanken_SBX_test + Talbanken_SBX_dev + SIC2 (with Saldo as dictionary). We cannot evaluate the performance of this model, but we expect it to perform better than marmot_eval, or at least not worse.
Tagging and training
Download Marmot and the necessary dependencies. Download SALDO (converted to the necessary format) here. Download our scripts from this repository.
The scripts use a tab-separated three-column format: token, POS (without MSD), MSD. Use conllu_to_tab.rb to convert CONLL(U) to the two-column format (install Ruby 1.9+ and run ruby conllu_to_tab 2 n, where n is the number of the column you want to use (if you are converting our CONLLU files, use 4). Run ruby convert_col2_to_marmot.rb to convert the resulting col2 file to Marmot's col3).

Tagging
Use java -cp marmot.jar marmot.morph.cmd.Annotator --model-file model_name.marmot --test-file form-index=0,test_corpus.col1 --pred-file output_name.conll to tag a corpus using a pretrained model. The output corpus will be in a CONLL format with a somewhat unusual order of columns, use convert_marmot_to_conllu.rb to convert it to a usual CONLLU.
Training your own models
Run Marmot: java -Xmx5G -cp marmot.jar marmot.morph.cmd.Trainer -train-file form-index=0,tag-index=1,morph-index=2,corpus.col3 -tag-morph true -model-file model_name.marmot  subtag-separator "." -type-dict saldo_marmot.txt,indexes=[2,3]</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>