<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">Gene annotation of Blastobotrys mokoenaii, Blastobotrys illinoisensis, and Blastobotrys malaysiensis</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-28606814-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.28606814</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.28606814">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">Gene annotation of Blastobotrys mokoenaii, Blastobotrys illinoisensis, and Blastobotrys malaysiensis</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-28606814-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.28606814</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Science for Life Laboratory">Borges Dias, Guilherme</AuthEnty>
      </rspStmt>
      <prodStmt>
        <grantNo xml:lang="en" agency="Novo Nordisk Foundation">NNF22OC0079899</grantNo>
      </prodStmt>
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2025-03-21" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2025-03-21" />
      </verStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.28606814">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">This dataset contains the gene annotation data for three species of Blastobotrys yeats: B. mokoenaii, B. illinoisensis, and B. malaysiensis.

The genome assemblies for B. mokoenaii (NRRL Y-27120) and B. malaysiensis (NRRL Y-6417) were publicly available on the National Center for Biotechnology Information (NCBI) under accessions GCA_003705765.3 and GCA_030558815.1, respectively.

The genome assembly for B. illinoisensis (NRRL YB-1343) was generated by SciLifeLab's National Genomics Infrastructure (NGI) using PacBio long-read data and deposited in the European Nucleotide Archive (ENA) under accession GCA_965113335.1.

File description- bmokoenaii_annotation.gff
This file contains the gene models predicted for B. mokoenaii (GCA_003705765.3).
- billinoisensis_annotation.gff
This file contains the gene models predicted for B. illinoisensis (GCA_003705765.3).
- bmalaysiensis_annotation.gff
This file contains the gene models predicted for B. malaysiensis (GCA_030558815.1).
Gene annotation methodsRepeat MaskingPrior to annotation, a repeat library was built for each species using RepeatModeler2 v2.0.2 and the genomes were soft-masked using RepeatMasker v4.1.5.


$ RepeatModeler -database ${DB} -engine ncbi -pa 16
$ RepeatMasker -dir . -gff -u -no_is -xsmall -e ncbi -lib ${LIBRARY} -pa 16 genome.fasta

Structural Annotation
Structural annotation was performed on the soft-masked genomes using Braker3 v3.0.3 incorporating external evidence in the form of all fungal proteins from OrthoDB v11 (available at https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11).


$ braker.pl --genome="$genome" \

--prot_seq=${protein} --workingdir=${PWD} \
--gff3 --threads=16 --verbosity=3 \
--nocleanup --species=${i}

Functional Annotation

The predicted genes were functionally annotated using the National Bioiformatics Infrastructure Sweden (NBIS) functional_annotation nextflow pipeline v2.0.0 (https://github.com/NBISweden/pipelines-nextflow). Briefly, this pipeline performs similarity searches between the annotated proteins and the UniProtKB/Swiss-Prot database (downloaded on 2023-12) using the Basic Local Alignment Search Tool (BLAST). Then it uses InterProScan to query the proteins against InterPro v59-91 databases, and merges results using AGAT v1.2.0.

tRNAs and rRNAs

Transfer RNA (tRNA) and ribosomal RNA (rRNA) genes were annotated using tRNAscan-SE v2.0.12 and barrnap v0.9, respectively. Other ncRNAs, such as SRP RNA, RNase P RNA, spliceosomal ncRNAs etc. have not been predicted. Finnally, the functionally annotated protein-coding genes, tRNAs, and rRNAs were combined into a single GFF file using AGAT v1.2.0.

$ tRNAscan-SE -E --gff ${output}_trnas.gff --thread 16 ${genome}.fasta
$ barrnap --kingdom euk --threads 6 ${genome}.fasta &gt; ${output}_rrna.gff

Annotation integrationFinnally, the functionally annotated protein-coding genes, tRNAs, and rRNAs were combined into a single GFF file using AGAT v1.2.0.

$ agat_sp_complement_annotations.pl --ref ${protein_coding} --add ${trna} --add ${rrna} --out full_annotation.gff</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>